Welcome to the course on Statistics for Finance and Investment. This course is designed primarily for those with a focus on investing and finance. To succeed in the course, students are expected to bring three prerequisites:
Common sense and curiosity.
A basic understanding of accounting.
A fundamental grasp of statistics.
Statistics plays an essential role in our everyday decision-making, especially in finance and healthcare. Understanding statistics is crucial as it allows us to interpret data accurately:
Statistics influence health policies (e.g., vaccine efficacy rates).
Insurance rates are based on statistical data about disease prevalence.
Polling data during elections illustrates how sample selection impacts outcomes.
The course aims to enhance your statistical knowledge for practical applications in finance. By the end of the 15 sessions, participants should be able to:
Understand various statistical tools and when to use them.
Recognize misapplication of statistics and ask informed questions.
Apply statistical reasoning to financial decision-making.
The application of statistics in investment allows for better decision-making, exemplified by the concept of Moneyball. This term refers to using statistical analysis to gain competitive advantages, particularly in contexts normally driven by intuition.
“Statistics allows for better decisions than gut feelings or rules of thumb.”
By participating in this course, you will gain a robust understanding of statistical tools that apply directly to finance and investment contexts. Engaging with these concepts will enable students to navigate data-driven environments more effectively, thus improving their decision-making capabilities.
Welcome to the first session of a 15-session statistics class. This session will lay the groundwork for the course, discussing essential concepts in statistics, its significance, and the various components that compose the field.
Statistics is often referred to as the ‘data science’. It is a discipline designed to:
Collect data
Record data
Analyze data
Depict data
Convert data into information
It is crucial to understand and interpret data effectively, especially in an age where we are surrounded by vast amounts of information that can sometimes be contradictory or misleading.
Data can be defined in several ways:
Qualitative Data: Categorical information (e.g., colors, names).
Quantitative Data: Numerical information, further divided into:
Continuous Data: Can take any value in a range (e.g., height).
Discrete Data: Takes on specific values, usually integers (e.g., basketball scores).
Historically, data has been vital for decision-making. However, with modern advancements, we face challenges such as data overload, where the excessive amount of data can lead to confusion rather than clarity.
Statistics encompasses various activities, including:
Collection: Gathering data through observations or surveys.
Sampling: Selecting a subset of data to represent a larger population.
It is essential to ensure that sampling is unbiased and correctly executed to avoid erroneous conclusions. Two critical concepts to understand in this phase are:
Bias: A biased sample leads to inaccurate representations of the population.
Noise/Error: Every sample includes an element of error; understanding the standard error is crucial.
After collecting data, the next step is summarizing it using descriptive statistics, which include:
Measures of location (e.g., mean, median).
Measures of dispersion (e.g., variance, standard deviation).
Measures of skewness (e.g., identifying asymmetry in distributions).
Visual representation of data, such as histograms, enables better understanding and insights. A histogram counts the number of observations within specified ranges. The overall understanding of data distributions can provide crucial insights:
Normal Distribution: A common symmetric distribution.
Other Distributions: Distributions can also be asymmetric, depending on data behavior.
Exploring relationships between variables involves:
Correlation: Examining whether two variables move together.
Causation: Determining if one variable influences another.
Prediction: Using established relationships to forecast future outcomes.
Probability measures the likelihood of events:
Discrete Events: E.g., whether a company will survive or go bankrupt.
Continuous Variables: E.g., predicting whether profits exceed a certain threshold.
Tools such as probits and logit models can help estimate probabilities based on observable data.
The understanding of statistics is essential for informed decision-making in various fields, including finance, healthcare, and policy-making.
The remaining sessions will build on these foundational elements and enhance your understanding of statistics and its applications in the real world.
In this session of the statistics class, we will discuss the concept of populations and samples, with a particular focus on their relevance to data analysis.
A population refers to the entire universe of instances of an object or phenomenon we intend to study. For example, if we want to analyze how businesses behave in crises, the population includes every business around the world—large or small, public or private, regardless of geographic location.
A sample is a subset of the population. Due to practical limitations, we often rely on samples to draw conclusions about the population. For instance, during the COVID-19 pandemic, one might analyze a sample of publicly traded companies rather than all businesses.
There are two main approaches to sampling:
This method involves collecting data over time. For instance, analyzing the annual stock returns from the U.S. stock market since 1871 would amount to a time series sample. If only data from a specific period (e.g., 2000 to 2020) is used, it still qualifies as a sample.
This sampling involves looking at a snapshot from a population at a specific point in time. For example, if we focus on publicly traded companies with a market cap exceeding $10 million, we obtain a more reliable sample.
Sampling is often necessary due to:
Practicality: Analyzing an entire population may be infeasible.
Cost: Collecting data from the entire population can be prohibitively expensive.
Time: Quick results are often needed (e.g., political polls before elections).
Sampling can be divided into probability-based and non-probability-based methods.
In this method, samples are chosen at random. For example, selecting 500 companies at random from 9,000 publicly traded companies.
Here, the researcher uses specific criteria to select samples, such as picking the 500 largest market cap companies (similar to the S&P 500).
Every observation has an equal chance of being selected. However, this might lead to an unbalanced representation of different sectors.
The population is divided into strata (groups), and samples are taken from each stratum. This method ensures that various segments are adequately represented.
Population is divided into clusters based on a characteristic (e.g., alphabetical order of companies), and entire clusters are sampled.
Bias in sampling can originate from:
Exclusion: Members of the population are not represented in the sample (e.g., no phone access).
Self-Selection: Observations from individuals who opt into a sample may skew results (e.g., those seeking healthcare).
Non-Response: Individuals who do not respond to surveys may differ from those who do, skewing the sample.
Survivorship Bias: Only examining entities that survive over time may lead to erroneous conclusions.
Estimation from samples includes sampling noise or error, which is a natural part of sampling due to variability. For example, when tossing a fair coin, the outcomes over a finite number of tosses will not perfectly reflect the true 50/50 odds.
Observations should ideally be independent (current event unaffected by previous events) and identically distributed (drawn from the same probability distribution).
The law states that as sample size increases, the sample average approaches the population average:
limn → ∞X̄n = μ
where X̄n is the sample mean and μ is the population mean.
Regardless of population distribution, the sampling distribution of the sample mean tends toward a normal distribution as sample size increases:
$$\bar{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$
where σ is the population standard deviation and n is the sample size.
This inequality allows for statements about distributions that are not normal, asserting that:
$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$$
for any k > 1, indicating that the fraction of observations that lie within k standard deviations of the mean is at least $1 - \frac{1}{k^2}$.
In statistics, sampling is a fundamental concept that allows researchers to make inferences about populations. The choice between sampling methods, awareness of biases, and understanding of relevant statistical laws such as the Law of Large Numbers and the Central Limit Theorem are crucial in conducting reliable and valid analysis.
This session focuses on sampling questions and issues frequently encountered in finance and investing. A primary aspect is understanding market indices, which serve as a sample of publicly traded stocks. It is essential to note that these samples are not random; they are selected based on specific criteria.
Market indices represent a sample of stocks that help gauge overall market performance. Common indices include:
Dow Jones Industrial Average (DJIA): Initially published in 1896, includes 30 significant U.S. companies.
S&P 500: Comprises 500 of the largest market capitalization companies.
NYSE Composite: Represents all listed securities on the New York Stock Exchange.
NASDAQ Composite: Includes major stocks traded on the NASDAQ.
Wilshire 5000: A more extensive index containing around 5,000 publicly traded stocks.
Indices are chosen based on criteria such as market capitalization. For instance:
Market Cap: The S&P 500 is made up of the 500 largest market cap companies.
Subjective Selection: Some indices like the DJIA are influenced by subjective decisions made by index editors.
How stocks are included and their respective weights can significantly affect the performance representation of the index:
Equally Weighted Indices: All stocks receive the same weight.
Market Cap Weighted Indices: Stocks are weighted according to their market capitalization.
The diversity of indices can lead to different performance trends. For example, in 2020, the performance divergence among the DJIA, S&P 500, and NASDAQ was evident due to sector concentration, particularly in technology stocks.
Passive investing has become increasingly prevalent through index funds and ETFs:
The first index fund was established in the 1970s by Vanguard.
ETFs have proliferated in the past two decades.
Many index funds and ETFs claim to track indices but often have distinct compositions due to sampling methods:
Example: An energy ETF may hold a different set of energy stocks than another ETF, leading to varying returns despite both claiming to track energy stocks.
Investors often focus on specific time periods, ignoring earlier data due to reliability concerns. Historical data (e.g., post-1926 for the U.S. stock market) may distort broader market understanding.
The Price-to-Earnings (PE) ratio is a key metric used to classify stocks:
$$PE = \frac{\text{Price}}{\text{Earnings}}$$
Variants include normalized PE (using a moving average) and the CAPE ratio which adjusts for inflation.
When using historical averages for future predictions, be cognizant of the shifting market structures that may render past data less relevant—historical averages should not be seen as definitive future indicators.
The small cap premium indicates small companies may outperform larger companies. Historical evidence up to 1980 suggested this; however, current trends need careful analysis as past performance may not predict future results.
Historically:
From 1927 to 1980, small cap stocks outperformed.
Since then, assessment of this premium requires cautious evaluation based on current market conditions.
Many investment strategies are back-tested against historical data. Common questions to assess these strategies include:
What universe of stocks was used?
How were stocks classified and what time period was considered?
Did the analysis account for negative earnings, survivorship, and timing biases?
Event studies analyze stock performance surrounding specific events (e.g., mergers). Considerations include:
Sample Creation: How were events selected?
Timing: Was the announcement date accurate or influenced by information leakage?
Sampling is an integral part of finance and investing. Understanding the nuances of how samples are constructed and their limitations is crucial for making informed investment decisions. Careful consideration of indices, investment vehicles, historical data, and empirical findings will lead to more robust investment strategies.
Summary statistics provide a compact representation of data, enabling better comprehension and communication of numerical information. This session covers four primary categories of descriptive statistics:
Measures of Centrality
Measures of Dispersion
Measures of Symmetry
Measures of Extremes
These measures describe the central value of a data set.
The average, or mean, is calculated as follows:
$$\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}$$
where xi is each observation and n is the total number of observations.
A weighted average adjusts the significance of certain values:
$$\text{Weighted Average} = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}$$
where wi is the weight assigned to each observation.
The median is the middle value when data is sorted. For an odd number of observations (n):
$$\text{Median} = x_{\frac{n+1}{2}}$$
For an even number of observations:
$$\text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}$$
The mode is the most frequently occurring value in the dataset.
These measures indicate the spread of data around the central value.
The range is simply the difference between the maximum and minimum values:
Range = Max(X) − Min(X)
The IQR is the difference between the 75th percentile (Q3) and 25th percentile (Q1):
IQR = Q3 − Q1
Variance quantifies the degree of dispersion from the mean: For a population:
$$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$
For a sample:
$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$$
The standard deviation is the square root of variance:
$$\sigma = \sqrt{\sigma^2} \quad \text{(population)}$$
$$s = \sqrt{s^2} \quad \text{(sample)}$$
The CV is a standardized measure of dispersion, calculated as:
$$\text{CV} = \frac{\text{Standard Deviation}}{\text{Mean}}$$
Measures of symmetry assess the skewness of a data distribution.
Skewness quantifies the asymmetry of the distribution:
Positively skewed: Higher tail on the right.
Negatively skewed: Higher tail on the left.
These measures relate to the occurrence of outliers or extreme values.
Kurtosis measures the "tailedness" of the distribution:
$$\text{Kurtosis} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^4}{n \cdot s^4} - 3$$
High kurtosis indicates fatter tails (leptokurtic).
Low kurtosis indicates thinner tails (platykurtic).
In summary, understanding these measures enables better statistical analysis and data visualization. Key measures include:
Centrality: Mean, Median, Mode
Dispersion: Range, IQR, Variance, Standard Deviation, Coefficient of Variation
Symmetry: Skewness
Extremes: Kurtosis
In this session, we will discuss the application of descriptive statistics to real-world financial data. We will analyze returns on three asset classes:
U.S. Stocks
10-Year Treasury Bonds
Treasury Bills
For Treasury Bills, the return is simply the rate at the end of the year, as there is no price change.
The total return on Treasury Bonds includes two components:
Coupon payments
Changes in bond prices during the year
Total return can be expressed as:
Total Return = Coupon Payment + Price Change
For stocks, returns are calculated using:
Dividends received
Changes in stock price
The total stock return is:
Total Stock Return = Dividends + Price Change
Calculating average returns over the period from 1928 to 2020:
Average Return on Stocks:
$$\text{Average}_{stocks} = \frac{\sum_{i=1}^{93} \text{Return}_{stocks,i}}{93} = 11.64\%$$
Average Return on Treasury Bills:
$$\text{Average}_{T-Bills} = \frac{\sum_{i=1}^{93} \text{Return}_{T-Bills,i}}{93} = 3.36\%$$
Average Return on Treasury Bonds:
$$\text{Average}_{T-Bonds} = \frac{\sum_{i=1}^{93} \text{Return}_{T-Bonds,i}}{93} = 5.21\%$$
To assess risk, we calculate the standard deviation:
$$\sigma = \sqrt{\frac{\sum_{i=1}^{93} (x_i - \mu)^2}{n-1}}$$
Where μ is the average return and xi is each annual return.
For stocks:
σstocks = 19.49%
For Treasury Bonds and Bills, the same formula can be applied, showing their lower volatility compared to stocks.
To calculate the standard error:
$$SE = \frac{\sigma}{\sqrt{n}}$$
For stocks:
$$SE_{stocks} = \frac{19.49}{\sqrt{93}} \approx 2.02\%$$
The confidence intervals can be calculated as follows:
For 67% confidence:
[μ − SE, μ + SE] = [11.64 − 2.02, 11.64 + 2.02] ≈ [9.62, 13.66]
For 95% confidence:
[μ − 2 ⋅ SE, μ + 2 ⋅ SE] = [11.64 − 4.04, 11.64 + 4.04] ≈ [7.60, 15.68]
The median return for stocks is 14.22%, indicating a degree of skewness. Stocks exhibit a slight negative skewness leading to a median higher than the average return.
For Treasury Bonds and Bills, positive skewness is observed.
Kurtosis for normal distribution: 3.
Stocks exhibit kurtosis close to 3 despite a slight negative skew.
Higher kurtosis in Treasury Bonds indicating a higher likelihood of extreme values.
PE Ratio = $\frac{\text{Stock Price}}{\text{Earnings per Share}}$
Three measures of PE ratios:
Current PE (most recent fiscal year)
Trailing PE (last four quarters)
Forward PE (expected earnings next four quarters)
Distribution characteristics show a positive skew, where extreme values affect average more than median.
Cost of capital reflects the required return on investments, calculated for all publicly traded companies.
Differences in average and median cost of capital indicate varying skewness across regions.
The analysis of descriptive statistics provides insight into the performance and risk levels of different asset classes. Stocks, despite high returns, exhibit greater volatility, while Treasury securities show stability.
In this session, we transition from data descriptives to understanding distributions. Data descriptives focus on summary statistics, while distributions offer a visual and analytical framework to represent data characteristics.
Data descriptives include the following measures:
Measures of Centrality
Measures of Dispersion
Measures of Skewness
Measures of Kurtosis
While numerical summaries are informative, visual representations like histograms and bar charts are often more compelling and easier to interpret.
Bar Chart: Used for discrete data. Displays frequency of categories.
Histogram: Used for continuous data. Displays frequency of data intervals (bins).
For instance, if we analyze Price-to-Earnings (P/E) ratios:
Bar charts can display average P/E ratios by region.
Histograms can show the frequency of P/E ratios within defined ranges (e.g., 0 to 4, 4 to 8).
Once visualized, identify which statistical distribution best fits the data. Key characteristics include:
Symmetry
Skewness
Tail behavior (fat tails vs. thin tails)
The normal distribution is defined as:
X ∼ N(μ, σ2)
where μ is the mean and σ is the standard deviation. Properties include:
Symmetric around the mean.
Approximately 68% of observations fall within one standard deviation, 95% within two, and 99.7% within three standard deviations.
Criterion for normality: Check the percentage of observations outside three standard deviations.
When considering small sample sizes or specific data characteristics, alternative distributions may be more appropriate.
Similar to the normal distribution but with:
Lower peak
Wider tails
A symmetric distribution often used for bounded data, characterized by a pronounced peak and finite upper and lower bounds.
A type of bounded distribution where all outcomes are equally likely between two defined bounds.
If the data is not symmetric:
Use the Minimum Extreme Value Distribution for distributions with a long tail on the negative side.
Use the Log-Normal Distribution for data such as stock prices, where extreme values can be significantly positive.
The kurtosis of a distribution provides insights into tail behavior:
Normal Distribution: Kurtosis = 3
Excess Kurtosis:
Pearson Kurtosis = Kurtosis − 3
Pearson < 0: Thin tails (platykurtic)
Pearson = 0: Similar to normal (mesokurtic)
Pearson > 0: Fat tails (leptokurtic)
Follow this flowchart for selecting appropriate distributions:
Is the data continuous or discrete?
Is the data symmetric or asymmetric?
If symmetric, what are the tail characteristics (thin/fat)?
If asymmetric, which direction is the skew?
Common distributions include:
Symmetric Continuous: Normal, T, Triangular, Uniform
Asymmetric Continuous: Log-normal, Minimum Extreme Value
Discrete: Binomial, Negative Binomial, Hypergeometric
Understanding the characteristics of your data and the appropriate distribution models is crucial for effective data analysis. This session covered a range of distributions and guidance for selection, which will aid in the interpretation of statistical findings.
This document provides an overview of the importance of data distributions in finance, particularly focusing on the normal distribution. While the normal distribution simplifies analysis and modeling, it can lead to significant misjudgments if the underlying data does not conform to this distribution.
Simplifies theoretical and practical analysis.
Allows for the description of large datasets with just two parameters: the expected value (mean) and standard deviation (σ).
The normal distribution has the probability density function (PDF):
$$f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$
where:
x is the variable,
μ is the mean,
σ is the standard deviation.
Assuming all data is normally distributed can lead to significant errors, especially in finance where extreme outcomes (fat tails) can occur more frequently than predicted by the normal distribution.
Annual returns on Disney from 1962 to 2021 were analyzed. A histogram of returns illustrates the distribution visually.
A histogram shows returns typically ranging from -20% to over +120%, indicating extreme yearly returns.
A Quantile-Quantile (QQ) plot compares the quantiles of the empirical data to the quantiles of a normal distribution. Ideally, points should lie on the reference line if the data is normally distributed.
Conduct various statistical tests for normality, such as:
Shapiro-Wilk Test
Kolmogorov-Smirnov Test
For Disney, 7 out of 9 tests did not reject the normality hypothesis.
Annual returns on Apple from 1981 to 2020 were examined, showing a broader distribution with extreme values.
Annual returns are more likely to follow a normal distribution compared to daily returns.
Daily returns show high peaks and lower tails, leading to rejection of normality much more frequently.
If investments are misled by assuming normality, risk management systems may fail to protect against extreme outcomes:
Price behaviors often do not fit normal distributions.
If true distributions have fatter tails (higher chances of outliers), failure to account for extreme events could be disastrous.
Log prices can be more normally distributed due to the transformation:
Log Price = ln (Price)
However, using logarithmic transformations does not guarantee normality. For example, post-logarithmic, Apple’s stock prices still showed signs of deviation from normality.
Always analyze the data through histograms.
Use QQ plots and statistical tests for normality.
Be cautious when assuming normal distributions, as misestimations can lead to significant risk exposure.
Consider alternative distributions if normality is rejected, or use non-parametric methods.
Understanding the underlying data distributions in finance is vital for accurate risk assessment and management. Further exploration into empirical distributions and non-normality in financial datasets is encouraged.
In previous sessions, we explored single data variables and their measures of centrality, dispersion, and skewness. In this session, we shift our focus to analyzing relationships between two data variables. This can be useful in evaluating whether variables such as price-earnings ratios and interest rates, or earnings growth and GDP growth, exhibit any linkage.
When examining two data series, we seek to determine if they:
Move together in the same direction.
Move in different directions (inversely).
Are unrelated.
If a relationship exists, we can further investigate potential lead or lag effects, in which one variable influences the timing of another.
It is critical to differentiate between correlation and causation:
Correlation indicates a statistical relationship where two variables move together.
Causation implies that one variable directly influences the other, which requires further testing to establish.
It’s important to note that correlation does not imply causation; variables may appear correlated due to random chance or third-party influences.
To visually assess the relationship between two variables, we create a scatter plot:
Let x be the independent variable and y be the dependent variable.
Plot each pair of observations (xi, yi) on a graph.
Scatter plots allow for a visual assessment, revealing potential correlations between the two variables.
To quantify the strength of the relationship, we compute the correlation coefficient r. The most widely used is the Pearson correlation coefficient, calculated as:
$$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}}$$
where:
xi and yi are individual sample points,
x̄ and ȳ are the means of the x and y samples.
The correlation coefficient r ranges from -1 to 1:
r = 1 indicates a perfect positive correlation.
r = − 1 indicates a perfect negative correlation.
r = 0 indicates no correlation.
Another measure of the relationship between two variables is covariance, defined as:
$$\text{Cov}(X, Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n}$$
Unlike the correlation coefficient, covariance is not standardized, meaning its values can vary widely based on the scale of the variables. Thus, it is harder to interpret directly.
To analyze the relationship further, we can fit a regression line to the scatter plot. The most common approach is Ordinary Least Squares (OLS) regression, which minimizes the sum of the squared vertical distances (residuals) between the observed values and the regression line.
The regression equation takes the form:
y = β0 + β1x + ϵ
where:
β0 is the intercept,
β1 is the slope of the line,
ϵ represents the error term.
The slope β1 indicates the change in y per one unit change in x.
An important output from regression analysis is R2, which measures how well the independent variable explains the variation in the dependent variable. It quantifies the goodness of fit of the model:
$$R^2 = 1 - \frac{\text{SS}_{res}}{\text{SS}_{tot}}$$
where:
SSres is the sum of squares of residuals,
SStot is the total sum of squares.
An R2 of 0 indicates that the model does not explain any variability, while 1 indicates perfect explanatory power.
When expanding to more than two variables, we can conduct multiple regressions. The model extends to:
y = β0 + β1x1 + β2x2 + … + βkxk + ϵ
where each xi is an independent variable.
Multicollinearity : Occurs when independent variables are correlated with each other, complicating coefficient interpretations. VIF (Variance Inflation Factor) can be used to measure multicollinearity.
Homoscedasticity : The residuals should display constant variance across all levels of the independent variable. Patterns in residuals may indicate a problem with the model.
Normality of Residuals : The residuals should be normally distributed for OLS to yield statistically valid inference.
If the relationship between variables is non-linear, non-linear regression techniques can be employed, or you can transform variables to achieve linearity. Transformations may include taking logarithms or polynomial terms.
The analysis of relationships between two data variables involves a systematic approach of visual exploration (scatter plots), quantifying relationships (correlation and covariance), fitting models (regression analysis), and checking the underlying assumptions. These methods assess predictability, essential for sound decision-making in finance and investing.
When applying these methods, remember:
Correlation does not imply causation.
Always check regression assumptions.
Data from the past may not consistently predict future outcomes.
In finance, it is essential to understand relationships between data, as these relationships can apply to various macro and micro-level variables.
Macro-level example : The relationship between interest rates and inflation. Conventional wisdom states that when inflation rises, interest rates also rise.
Micro-level example : The relationship between a company’s operating margins and revenue growth.
There are two main motivations for analyzing data relationships:
To understand data better for research and analysis.
To forecast one variable based on the other for investment purposes, aiming to make profitable predictions.
One key model in finance is the Capital Asset Pricing Model (CAPM), which assesses the risk of an investment relative to a market portfolio. The idea is that the risk of an asset should be evaluated in the context of a diversified portfolio.
To measure this risk, a regression analysis can be employed:
Ri = α + βRm + ϵ
Where:
Ri = Return of the asset
α = Intercept term
β = Sensitivity of the asset to market returns
Rm = Return of the market
ϵ = Error term
Before running regression, it is useful to understand covariance and correlation between variables.
Covariance measures how two variables move together:
Cov(X, Y) = E[(X − μX)(Y − μY)]
A positive covariance indicates that both variables tend to move in the same direction. For the Disney and S&P 500 data, the covariance calculated was 0.00337.
Correlation (Pearson’s r) standardizes covariance, bounded between -1 and 1:
$$r_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$$
For Disney and S&P 500, the correlation was calculated as 0.85666.
The t-statistic tests if the correlation is significantly different from zero:
$$t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$$
where n is the sample size.
When analyzing the regression output:
Dependent variable : Return on Disney (Y)
Independent variable : Return on S&P 500 (X)
R-squared (R2): Measures goodness of fit, indicating the proportion of variance in the dependent variable explained by the independent variable.
$$R^2 = \frac{\text{SS}_{\text{regression}}}{\text{SS}_{\text{total}}}$$
Adjusted R-squared accounts for the number of predictors in the model, penalizing excessive variables.
Durbin-Watson statistic checks for autocorrelation in the residuals. A value close to 2 indicates no autocorrelation.
ANOVA (Analysis of Variance) assesses how well the independent variables explain the variance in the dependent variable.
F-statistic tests the overall significance of the regression model.
Understanding the coefficients from regression output:
Intercept : Indicates the expected return when all independent variables equal zero.
Slope coefficient : Indicates the change in the dependent variable for a one-unit change in the independent variable.
The significance can be tested using the t-statistic:
$$t = \frac{\text{coefficient}}{\text{standard error}}$$
The p-value indicates the probability that the observed relationship occurred by chance.
Running a regression of earnings yield against short and long-term treasury rates can reveal relationships.
Analyzing a group of banks based on their price-to-book ratio, return on equity, and risk can help in identifying undervalued banks.
Assessing the residuals is vital to ensure that the assumptions of regression analysis (normality, homoscedasticity) hold true.
Regressions are powerful analytical tools but must be used carefully.
Avoid small samples with too many independent variables to reduce misleading results.
Understand the output to effectively use regression in financial analysis and prediction.
In this session, we will delve into regression analysis, a fundamental statistical method used to explain relationships between variables. The key components to understand are:
Dependent Variable: The variable we are trying to explain (e.g., price-to-earnings (P/E) ratios).
Independent Variables: The variables that explain the dependent variable (e.g., payout ratios, growth rates).
To choose independent variables for regression analysis, there are two main approaches:
Statistical Approach: Collect data and identify which independent variables correlate the most with the dependent variable.
Common Sense and Economic Theory: Use economic models and intuition to select independent variables that logically influence the dependent variable.
To illustrate, consider the P/E ratios across companies. The basic framework for the P/E ratio can be derived from the Gordon Growth Model:
$$P_0 = \frac{D_1}{r - g}$$
Where:
P0: Price of the stock
D1: Expected dividends next year
r: Cost of equity
g: Growth rate
Dividing by earnings per share (EPS), we derive:
$$P/E = \frac{(D/E)}{(r - g)}$$
Where:
(D/E): Payout ratio, percentage of earnings paid out as dividends.
(r): Cost of equity (risk).
(g): Growth rate of dividends.
Thus, the relationship can be summarized as:
P/E = f(Payout Ratio, Growth Rate, Cost of Equity)
Before running a regression, it’s essential to visually inspect the relationship between the dependent and independent variables using scatter plots.
An example scatter plot of P/E ratios against growth rates may reveal a positive relationship but high variability (noise) around the fitted line, indicating a potentially low R2 value when the regression is performed.
In our case:
R2 ≈ 0.40: Indicates that 40% of the variability in P/E ratios can be explained by the independent variables.
Coefficients: Indicate the effect of each independent variable on the P/E ratio.
Statistically significant independent variables are determined via t-statistics or p-values.
When analyzing multiple independent variables, ensure they are independent of each other. High correlation among independent variables (multicollinearity) can distort results.
If multicollinearity exists, consider removing one of the correlated variables or transforming the data.
To enhance the model’s explanatory power:
Seek better proxies for the independent variables.
Introduce additional independent variables based on solid logical foundations.
Address outliers, which can artificially inflate or deflate R2.
Consider weighted least squares if variance across observations is unequal.
In practice, having more data can lead to data mining—selecting variables purely for a higher R2. This can lead to so-called p-hacking:
Researchers might artificially manipulate their dataset to show significant results.
Remember, statistical significance does not imply economic or practical significance. A statistically significant model may not yield profitable investment decisions due to trading costs and market frictions.
Regression analysis is a powerful tool, but it requires careful consideration when determining independent variables, analyzing correlations, accounting for multicollinearity, and ensuring that statistical significance translates into economic reality.
Probabilities are fundamental in understanding uncertain outcomes. In many situations, particularly in finance, we cannot predict events with certainty.
A probability measures the likelihood of an event occurring under uncertainty, mathematically formalized as follows:
P(A) = Probability of event A
where 0 ≤ P(A) ≤ 1:
P(A) = 0 indicates the event is impossible.
P(A) = 1 indicates the event is certain.
P(A) = 0.5 indicates the event is equally likely to happen or not (even odds).
In finance, estimating probabilities is crucial for decision-making. We often evaluate the likelihood of discrete events (e.g., bankruptcy) or continuous outcomes (e.g., future earnings).
Discrete Events: Outcomes that can be counted (e.g., bankruptcy: yes or no).
Continuous Events: Outcomes that can take on a range of values (e.g., earnings greater than 1 billion dollars).
There are two primary views of probability:
Based on the frequency of occurrence of events.
Repeated trials can eventually reveal the true probability.
Example: Flipping a fair coin many times yields a probability of heads converging to 0.5.
Probability reflects personal belief or estimation rather than frequency data.
Individuals may disagree on probabilities based on different available information.
Example: Different investors might estimate different probabilities for stock price movements based on their insights and analyses.
The likelihood of a series of outcomes occurring:
P(A1 ∩ A2 ∩ … ∩ An) = P(A1) + P(A2) + … + P(An) − P(A1 ∩ A2) − …
The probability of an event occurring given another event:
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$
Probabilities are bounded: 0 ≤ P(A) ≤ 1.
The total of all possible outcomes must equal 1:
∑iP(Ai) = 1
Complement Rule:
P(A′) = 1 − P(A)
where A′ is the event that A does not occur.
General Addition Rule:
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Multiplication Rule for Independent Events:
P(A ∩ B) = P(A) ⋅ P(B)
Bayes’ theorem relates conditional probabilities and shows how to update probabilities given new evidence:
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
This emphasizes that probabilities are derived from prior knowledge and can be adjusted with new information.
These models estimate probabilities for binary outcomes (0 or 1) based on independent variables.
Uses a normal distribution to convert predictions:
P(Y = 1|X) = Φ(Xβ)
Employs a logistic distribution:
$$P(Y=1|X) = \frac{1}{1 + e^{-X\beta}}$$
Example: Predicting Bankruptcy
Historical data helps create independent variables such as earnings and debt levels.
Dependent variable is whether a firm went bankrupt (1) or not (0).
Example: Predicting Acquisitions
Independent variables may include stock performance metrics.
Similar modeling to predict whether a company will be acquired.
A useful way to visualize and assess sequential risk scenarios.
Event Nodes: Outcomes determined by chance, governed by probabilities.
Decision Nodes: Points where choices are made.
End Nodes: Final outcomes resulting from decisions and events.
To analyze a decision tree, roll back from end nodes to make decisions based on expected values.
Used to evaluate the impact of different future possible situations based on continuous outcomes.
Plausible scenarios for varying economic conditions.
Best-case/Worst-case analyses, though often criticized for not providing much useful information.
Understanding and applying probabilities is crucial in investing and finance. By utilizing tools such as Bayesian inference, Probit/Logit models, and decision trees, investors can better navigate uncertainty and make informed decisions.
In this session, we explore the application of probabilistic tools to questions in investing and finance. We will delve into how markets operate under the assumption of a random walk and how this influences investment strategies.
The market is said to follow a random walk, meaning that there is an equal probability of the market going up or down on any given day:
P(up) = P(down) = 0.5
If we observe the market over n trading days, we expect to see:
$$\text{Expected Up Days} = \frac{n}{2}, \quad \text{Expected Down Days} = \frac{n}{2}$$
For example, over 100 days, we anticipate roughly 50 up days and 50 down days.
Researchers collect data over n trading days and analyze the actual distribution of up and down days. To estimate the standard error (σ) for the difference, the formula is:
$$\sigma = \sqrt{\frac{P(\text{up}) \cdot P(\text{down})}{n}} = \sqrt{\frac{0.5 \cdot 0.5}{n}}$$
For n = 100:
$$\sigma = \sqrt{\frac{0.25}{100}} = 0.05$$
From the standard error, we can calculate confidence intervals:
95%Confidence Interval = p̂ ± 2σ
Over 1257 trading days (2016-2020):
Up days: 700 (55.69%)
Down days: 557 (44.31%)
For the proportion of up days:
P(up) = 0.5569, P(down) = 0.4431
The standard error is calculated as:
$$\sigma = \sqrt{\frac{0.5569 \cdot 0.4431}{1257}} \approx 0.0141$$
At 95% confidence,
Confidence Interval = 0.5569 ± 2(0.0141) = (0.5287, 0.5851)
Since 50% falls outside this range, we can reject the hypothesis of equal up and down days.
If the market was up the previous day, we analyze the likelihood for the next day:
After an up day: 367 up days, 333 down days (52.43%)
After a down day: 332 up days, 225 down days (59.6%)
The probability following an up day is not significantly different from 50%. However, there is a significantly higher chance of up days after down days (59.6% vs. 50%).
To find the probability of consecutive updates or down days:
P(three up days) = (0.5569)3 = 0.173
For five up days:
P(five up days) = (0.5569)5 ≈ 0.0536
Funds can be classified into quartiles based on past performance, and the likelihood of performance persistence can be analyzed.
The probability of corporate bonds defaulting changes with economic conditions.
High yield bond default rate over 20 years may reach approximately 50%.
The Altman Z-score combines specific ratios to class companies based on their likelihood of bankruptcy:
$$Z = 1.2 \frac{WC}{TA} + 1.4 \frac{RE}{TA} + 3.3 \frac{EBIT}{TA} + 0.6 \frac{MV}{TL} + \text{constant}$$
Decision trees are useful for modeling complex decisions under uncertainty, such as in pharmaceutical approvals.
Expected value can be calculated as:
Expected Value = P(success) × Cash Flow
Scenario analysis aids in evaluating the impact of different potential future events on value, focusing on extreme and likely outcomes.
In this session, we will explore the integration of data distributions, probabilities, and Monte Carlo simulations as applied in data analysis and decision-making, particularly in finance.
Typically, in finance, we estimate independent variables which inform an output variable that we are attempting to explain.
Example: Valuing a company involves key inputs such as revenue growth and operating margins.
Suppose we estimate revenue growth at g = 12% and operating margin at m = 8%.
Point estimates, while convenient, are limiting due to inherent uncertainties in actual values.
Instead of providing point estimates, we replace each with a probability distribution.
For example, revenue growth could be modeled with a distribution centered around 12% but allowing for variance (e.g., a range between 4% and 20%).
Monte Carlo simulations allow us to:
Select a random value from each of the distributions defined for independent variables.
Run simulations multiple times to produce a range of outputs, providing a distribution of output values rather than a single estimate.
For successful simulations:
Build a sound base model connecting independent variables to the output variable.
Keep the model simple and transparent.
Focus on the most significant independent variables in terms of their effect on outputs.
Analyzing uncertainties involves distinguishing between: Discrete vs. Continuous Uncertainty:
Continuous Uncertainty: e.g., margins that can take on a range of values.
Discrete Uncertainty: e.g., regulatory changes that can either happen or not (0 or 1).
Symmetric vs. Asymmetric Risk:
Symmetric: Uncertainty exists equally on both sides.
Asymmetric: Greater likelihood of being wrong in one direction (e.g., margins could drop significantly).
It is crucial to identify the right type of distributions (e.g., normal, log-normal, triangular) to model your uncertainties.
In a valuation example performed in May 2016: Point Estimates:
Revenue Growth (g): 1.5%
Operating Margin (m): 25%
A base case valuation provided a value of 126 per share.
For Revenue Growth, a log-normal distribution was chosen due to the belief in positive surprise potential.
For Operating Margin, a triangular distribution was used, with bounds set to prevent unrealistic values.
Using Crystal Ball (an Excel add-on):
Conducted 100, 000 simulations.
Resulting output showed distributions of value per share.
Median Value: Approximately $123 per share.
Value Range:
10th Decile: $99
90th Decile: $157
Indicates a high probability (93%) that Apple is undervalued compared to its market trading price of $93.
Monte Carlo simulations provide not just a point estimate but a fuller picture of uncertainty, allowing analysts to make more informed decisions. By looking at the distribution of potential outcomes, analysts can more effectively address and understand the risks involved.
Establish a clear model linking independent variables and outputs.
Use distributions to represent uncertainties accurately.
Employ simulations to analyze and understand variability in outcomes.