
1) 【一句话结论】通过SQL查询获取股票与市场指数的月度收益率数据,并采用线性插值处理缺失值、Z-score法识别异常值,为协整分析提供基础数据。
2) 【原理/概念讲解】
首先,月度收益率计算:采用对数收益率((r_t = \ln(P_t/P_{t-1}))),因对数收益率具有线性化、无漂移、方差稳定的特性,适合时间序列分析。
其次,协整关系:是两个非平稳时间序列的线性组合存在平稳性(如Engle-Granger两步法的基础),需先计算收益率(平稳序列)再检验。
缺失值处理:月度数据若某月无数据,采用线性插值(基于前后有效月份的收益率拟合直线插值),保留时间趋势;异常值处理:采用Z-score法(计算观测值与均值的偏差除以标准差,绝对值>3视为异常),剔除显著偏离的异常值。
3) 【对比与适用场景】
| 方法/类别 | 定义 | 特性 | 使用场景 | 注意点 |
|---|---|---|---|---|
| 缺失值处理(线性插值) | 基于前后有效数据点拟合直线,插值缺失值 | 保留时间趋势,适合连续缺失 | 数据缺失较少、时间序列趋势稳定 | 可能引入趋势偏差 |
| 缺失值处理(均值填充) | 用历史均值填充缺失值 | 简单直接,忽略时间趋势 | 缺失值占比低、数据波动小 | 可能平滑真实波动 |
| 异常值处理(Z-score法) | 计算观测值与均值的偏差除以标准差,绝对值>3为异常 | 简单直观,依赖正态分布假设 | 数据近似正态分布 | 标准差波动大时,阈值需调整 |
| 异常值处理(IQR法) | 用四分位数范围((Q3 - Q1))乘以1.5,判断是否在区间外 | 不依赖正态分布 | 数据分布偏斜 | 对极端值敏感 |
4) 【示例】
假设表结构:
stock_data(股票日数据):date(日期)、stock_code(股票代码)、close_price(收盘价)market_index(市场指数日数据):date(日期)、index_code(指数代码)、close_price(收盘价)步骤1:计算股票月度收益率
WITH stock_monthly AS (
SELECT
date,
stock_code,
close_price,
LAG(close_price, 1) OVER (PARTITION BY stock_code ORDER BY date) AS prev_close
FROM stock_data
),
stock_returns AS (
SELECT
date,
stock_code,
close_price,
prev_close,
CASE
WHEN prev_close IS NULL THEN NULL
ELSE LOG(close_price / prev_close)
END AS stock_return
FROM stock_monthly
)
SELECT
date,
stock_code,
stock_return
FROM stock_returns
WHERE date >= DATEADD(month, -12, GETDATE()) -- 近一年
ORDER BY date, stock_code;
步骤2:计算市场指数月度收益率
WITH market_monthly AS (
SELECT
date,
index_code,
close_price,
LAG(close_price, 1) OVER (PARTITION BY index_code ORDER BY date) AS prev_close
FROM market_index
),
market_returns AS (
SELECT
date,
index_code,
close_price,
prev_close,
CASE
WHEN prev_close IS NULL THEN NULL
ELSE LOG(close_price / prev_close)
END AS market_return
FROM market_monthly
)
SELECT
date,
index_code,
market_return
FROM market_returns
WHERE date >= DATEADD(month, -12, GETDATE())
ORDER BY date, index_code;
步骤3:合并数据并处理缺失值(线性插值)
WITH stock_returns_clean AS (
SELECT
date,
stock_code,
stock_return,
ROW_NUMBER() OVER (PARTITION BY stock_code ORDER BY date) AS rn
FROM stock_returns
),
market_returns_clean AS (
SELECT
date,
index_code,
market_return,
ROW_NUMBER() OVER (PARTITION BY index_code ORDER BY date) AS rn
FROM market_returns
),
stock_interpolated AS (
SELECT
s.date,
s.stock_code,
s.stock_return,
COALESCE(s.stock_return,
(SELECT stock_return FROM stock_returns_clean sr
WHERE sr.stock_code = s.stock_code
AND sr.rn = s.rn - 1) -- 前值
) AS interpolated_stock_return
FROM stock_returns_clean s
),
market_interpolated AS (
SELECT
m.date,
m.index_code,
m.market_return,
COALESCE(m.market_return,
(SELECT market_return FROM market_returns_clean mr
WHERE mr.index_code = m.index_code
AND mr.rn = m.rn - 1) -- 前值
) AS interpolated_market_return
FROM market_returns_clean m
)
SELECT
s.date,
s.stock_code,
s.interpolated_stock_return,
m.interpolated_market_return
FROM stock_interpolated s
JOIN market_interpolated m ON s.date = m.date;
步骤4:异常值处理(Z-score法)
WITH stock_returns_clean AS (
SELECT
date,
stock_code,
stock_return,
AVG(stock_return) OVER (PARTITION BY stock_code) AS mean_return,
STD(stock_return) OVER (PARTITION BY stock_code) AS std_return,
CASE
WHEN stock_return IS NULL THEN NULL
ELSE ABS(stock_return - mean_return) / std_return
END AS z_score
FROM stock_returns
),
stock_no_outliers AS (
SELECT
date,
stock_code,
stock_return
FROM stock_returns_clean
WHERE z_score <= 3 -- 剔除Z-score>3的异常值
)
SELECT
date,
stock_code,
stock_return
FROM stock_no_outliers
ORDER BY date, stock_code;
5) 【面试口播版答案】
“首先,我们需要通过SQL查询获取股票与市场指数的月度收益率数据。计算收益率时,我们采用对数收益率((r_t = \ln(P_t/P_{t-1}))),因为对数收益率能更好地捕捉价格变化,且满足时间序列分析的要求。具体来说,先从股票和指数的日收盘价表中,通过LAG函数获取前一个月的收盘价,然后计算对数收益率。接着,处理缺失值,对于月度数据,若某月无数据,我们采用线性插值法,基于前后有效月份的收益率进行插值,以保留时间趋势。对于异常值,我们采用Z-score法,计算每个股票月度收益率的均值和标准差,将Z-score超过3的视为异常值并剔除,确保数据质量。最后,将处理后的股票与市场指数月度收益率数据合并,用于后续的协整关系分析。”
6) 【追问清单】
7) 【常见坑/雷区】