contents

Statistics 101

Introduction

Welcome to the course on Statistics for Finance and Investment. This course is designed primarily for those with a focus on investing and finance. To succeed in the course, students are expected to bring three prerequisites:

  1. Common sense and curiosity.

  2. A basic understanding of accounting.

  3. A fundamental grasp of statistics.

Importance of Statistics

Statistics plays an essential role in our everyday decision-making, especially in finance and healthcare. Understanding statistics is crucial as it allows us to interpret data accurately:

Course Objectives

The course aims to enhance your statistical knowledge for practical applications in finance. By the end of the 15 sessions, participants should be able to:

  1. Understand various statistical tools and when to use them.

  2. Recognize misapplication of statistics and ask informed questions.

  3. Apply statistical reasoning to financial decision-making.

The Role of Statistics in Investment

The application of statistics in investment allows for better decision-making, exemplified by the concept of Moneyball. This term refers to using statistical analysis to gain competitive advantages, particularly in contexts normally driven by intuition.

“Statistics allows for better decisions than gut feelings or rules of thumb.”

Conclusion

By participating in this course, you will gain a robust understanding of statistical tools that apply directly to finance and investment contexts. Engaging with these concepts will enable students to navigate data-driven environments more effectively, thus improving their decision-making capabilities.

Introduction to Statistics

Introduction

Welcome to the first session of a 15-session statistics class. This session will lay the groundwork for the course, discussing essential concepts in statistics, its significance, and the various components that compose the field.

What is Statistics?

Statistics is often referred to as the ‘data science’. It is a discipline designed to:

It is crucial to understand and interpret data effectively, especially in an age where we are surrounded by vast amounts of information that can sometimes be contradictory or misleading.

Definitions of Data

Types of Data

Data can be defined in several ways:

Importance of Data

Historically, data has been vital for decision-making. However, with modern advancements, we face challenges such as data overload, where the excessive amount of data can lead to confusion rather than clarity.

Components of Statistics

Data Collection and Sampling

Statistics encompasses various activities, including:

It is essential to ensure that sampling is unbiased and correctly executed to avoid erroneous conclusions. Two critical concepts to understand in this phase are:

Descriptive Statistics

After collecting data, the next step is summarizing it using descriptive statistics, which include:

Data Visualization

Visual representation of data, such as histograms, enables better understanding and insights. A histogram counts the number of observations within specified ranges. The overall understanding of data distributions can provide crucial insights:

Correlation and Causation

Exploring relationships between variables involves:

Probability in Statistics

Probability measures the likelihood of events:

Tools such as probits and logit models can help estimate probabilities based on observable data.

Conclusions

The understanding of statistics is essential for informed decision-making in various fields, including finance, healthcare, and policy-making.

The remaining sessions will build on these foundational elements and enhance your understanding of statistics and its applications in the real world.

Populations and Samples

Introduction

In this session of the statistics class, we will discuss the concept of populations and samples, with a particular focus on their relevance to data analysis.

Definitions

Population

A population refers to the entire universe of instances of an object or phenomenon we intend to study. For example, if we want to analyze how businesses behave in crises, the population includes every business around the world—large or small, public or private, regardless of geographic location.

Sample

A sample is a subset of the population. Due to practical limitations, we often rely on samples to draw conclusions about the population. For instance, during the COVID-19 pandemic, one might analyze a sample of publicly traded companies rather than all businesses.

Sampling Methods

There are two main approaches to sampling:

Time Series Sample

This method involves collecting data over time. For instance, analyzing the annual stock returns from the U.S. stock market since 1871 would amount to a time series sample. If only data from a specific period (e.g., 2000 to 2020) is used, it still qualifies as a sample.

Cross-Sectional Sample

This sampling involves looking at a snapshot from a population at a specific point in time. For example, if we focus on publicly traded companies with a market cap exceeding $10 million, we obtain a more reliable sample.

Reasons for Sampling

Sampling is often necessary due to:

Types of Sampling

Sampling can be divided into probability-based and non-probability-based methods.

Probability-Based Sampling

In this method, samples are chosen at random. For example, selecting 500 companies at random from 9,000 publicly traded companies.

Non-Probability-Based Sampling

Here, the researcher uses specific criteria to select samples, such as picking the 500 largest market cap companies (similar to the S&P 500).

Variants of Random Sampling

Simple Random Sample

Every observation has an equal chance of being selected. However, this might lead to an unbalanced representation of different sectors.

Stratified Random Sample

The population is divided into strata (groups), and samples are taken from each stratum. This method ensures that various segments are adequately represented.

Cluster Random Sampling

Population is divided into clusters based on a characteristic (e.g., alphabetical order of companies), and entire clusters are sampled.

Challenges and Bias in Sampling

Bias in sampling can originate from:

Sampling Errors

Estimation from samples includes sampling noise or error, which is a natural part of sampling due to variability. For example, when tossing a fair coin, the outcomes over a finite number of tosses will not perfectly reflect the true 50/50 odds.

Key Statistical Concepts

Independence and Identical Distributions (IID)

Observations should ideally be independent (current event unaffected by previous events) and identically distributed (drawn from the same probability distribution).

Law of Large Numbers

The law states that as sample size increases, the sample average approaches the population average:


limn → ∞n = μ

where n is the sample mean and μ is the population mean.

Central Limit Theorem (CLT)

Regardless of population distribution, the sampling distribution of the sample mean tends toward a normal distribution as sample size increases:


$$\bar{X} \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$

where σ is the population standard deviation and n is the sample size.

Chebyshev’s Inequality

This inequality allows for statements about distributions that are not normal, asserting that:


$$P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$$

for any k > 1, indicating that the fraction of observations that lie within k standard deviations of the mean is at least $1 - \frac{1}{k^2}$.

Conclusion

In statistics, sampling is a fundamental concept that allows researchers to make inferences about populations. The choice between sampling methods, awareness of biases, and understanding of relevant statistical laws such as the Law of Large Numbers and the Central Limit Theorem are crucial in conducting reliable and valid analysis.

Sampling Issues in Finance and Investing

Introduction

This session focuses on sampling questions and issues frequently encountered in finance and investing. A primary aspect is understanding market indices, which serve as a sample of publicly traded stocks. It is essential to note that these samples are not random; they are selected based on specific criteria.

Market Indices as Samples

Definition and Purpose

Market indices represent a sample of stocks that help gauge overall market performance. Common indices include:

Selection Criteria

Indices are chosen based on criteria such as market capitalization. For instance:

Weighting Schemes

How stocks are included and their respective weights can significantly affect the performance representation of the index:

Impact of Index Divergence

The diversity of indices can lead to different performance trends. For example, in 2020, the performance divergence among the DJIA, S&P 500, and NASDAQ was evident due to sector concentration, particularly in technology stocks.

Sampling in Investment Vehicles

Investment Strategies

Passive investing has become increasingly prevalent through index funds and ETFs:

Specific Index Sampling

Many index funds and ETFs claim to track indices but often have distinct compositions due to sampling methods:

Time Sampling Issues

Historical Data Limitations

Investors often focus on specific time periods, ignoring earlier data due to reliability concerns. Historical data (e.g., post-1926 for the U.S. stock market) may distort broader market understanding.

PE Ratios and Normalized Earnings

The Price-to-Earnings (PE) ratio is a key metric used to classify stocks:
$$PE = \frac{\text{Price}}{\text{Earnings}}$$

Caution in Extrapolating Findings

When using historical averages for future predictions, be cognizant of the shifting market structures that may render past data less relevant—historical averages should not be seen as definitive future indicators.

The Small Cap Premium

The small cap premium indicates small companies may outperform larger companies. Historical evidence up to 1980 suggested this; however, current trends need careful analysis as past performance may not predict future results.

Key Statistics

Historically:

Empirical Irregularities and Testing Investment Strategies

Many investment strategies are back-tested against historical data. Common questions to assess these strategies include:

Event Studies in Finance

Event studies analyze stock performance surrounding specific events (e.g., mergers). Considerations include:

Conclusion

Sampling is an integral part of finance and investing. Understanding the nuances of how samples are constructed and their limitations is crucial for making informed investment decisions. Careful consideration of indices, investment vehicles, historical data, and empirical findings will lead to more robust investment strategies.

Summary Statistics

Introduction

Summary statistics provide a compact representation of data, enabling better comprehension and communication of numerical information. This session covers four primary categories of descriptive statistics:

  1. Measures of Centrality

  2. Measures of Dispersion

  3. Measures of Symmetry

  4. Measures of Extremes

Measures of Centrality

These measures describe the central value of a data set.

1. Average (Mean)

The average, or mean, is calculated as follows:
$$\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}$$
where xi is each observation and n is the total number of observations.

2. Weighted Average

A weighted average adjusts the significance of certain values:
$$\text{Weighted Average} = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}$$
where wi is the weight assigned to each observation.

3. Median

The median is the middle value when data is sorted. For an odd number of observations (n):
$$\text{Median} = x_{\frac{n+1}{2}}$$
For an even number of observations:
$$\text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}$$

4. Mode

The mode is the most frequently occurring value in the dataset.

Measures of Dispersion

These measures indicate the spread of data around the central value.

1. Range

The range is simply the difference between the maximum and minimum values:
Range = Max(X) − Min(X)

2. Interquartile Range (IQR)

The IQR is the difference between the 75th percentile (Q3) and 25th percentile (Q1):
IQR = Q3 − Q1

3. Variance

Variance quantifies the degree of dispersion from the mean: For a population:
$$\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$$
For a sample:
$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$$

4. Standard Deviation

The standard deviation is the square root of variance:
$$\sigma = \sqrt{\sigma^2} \quad \text{(population)}$$

$$s = \sqrt{s^2} \quad \text{(sample)}$$

5. Coefficient of Variation (CV)

The CV is a standardized measure of dispersion, calculated as:
$$\text{CV} = \frac{\text{Standard Deviation}}{\text{Mean}}$$

Measures of Symmetry

Measures of symmetry assess the skewness of a data distribution.

1. Skewness

Skewness quantifies the asymmetry of the distribution:

Measures of Extremes

These measures relate to the occurrence of outliers or extreme values.

1. Kurtosis

Kurtosis measures the "tailedness" of the distribution:
$$\text{Kurtosis} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^4}{n \cdot s^4} - 3$$

Conclusion

In summary, understanding these measures enables better statistical analysis and data visualization. Key measures include:

Descriptive Statistics in Financial Data

Introduction

In this session, we will discuss the application of descriptive statistics to real-world financial data. We will analyze returns on three asset classes:

Calculating Returns

Treasury Bills

For Treasury Bills, the return is simply the rate at the end of the year, as there is no price change.

Treasury Bonds

The total return on Treasury Bonds includes two components:

Total return can be expressed as:
Total Return = Coupon Payment + Price Change

Stock Returns

For stocks, returns are calculated using:

The total stock return is:
Total Stock Return = Dividends + Price Change

Descriptive Statistics: Average Returns

Calculating average returns over the period from 1928 to 2020:

Risk Assessment: Standard Deviation

To assess risk, we calculate the standard deviation:
$$\sigma = \sqrt{\frac{\sum_{i=1}^{93} (x_i - \mu)^2}{n-1}}$$
Where μ is the average return and xi is each annual return.

For stocks:
σstocks = 19.49%

For Treasury Bonds and Bills, the same formula can be applied, showing their lower volatility compared to stocks.

Standard Error

To calculate the standard error:
$$SE = \frac{\sigma}{\sqrt{n}}$$
For stocks:
$$SE_{stocks} = \frac{19.49}{\sqrt{93}} \approx 2.02\%$$

The confidence intervals can be calculated as follows:

Median and Distribution Characteristics

Kurtosis

Further Analysis: PE Ratios and Cost of Capital

PE Ratios

Cost of Capital

Conclusion

The analysis of descriptive statistics provides insight into the performance and risk levels of different asset classes. Stocks, despite high returns, exhibit greater volatility, while Treasury securities show stability.

Data Distributions

Introduction

In this session, we transition from data descriptives to understanding distributions. Data descriptives focus on summary statistics, while distributions offer a visual and analytical framework to represent data characteristics.

Data Descriptives

Data descriptives include the following measures:

While numerical summaries are informative, visual representations like histograms and bar charts are often more compelling and easier to interpret.

Visual Displays

Bar Charts vs. Histograms

Examples

For instance, if we analyze Price-to-Earnings (P/E) ratios:

Identifying the Right Distribution

Once visualized, identify which statistical distribution best fits the data. Key characteristics include:

Normal Distribution

The normal distribution is defined as:
X ∼ N(μ, σ2)
where μ is the mean and σ is the standard deviation. Properties include:

Criterion for normality: Check the percentage of observations outside three standard deviations.

Alternative Distributions

When considering small sample sizes or specific data characteristics, alternative distributions may be more appropriate.

T-Distribution

Similar to the normal distribution but with:

Triangular Distribution

A symmetric distribution often used for bounded data, characterized by a pronounced peak and finite upper and lower bounds.

Uniform Distribution

A type of bounded distribution where all outcomes are equally likely between two defined bounds.

Skewed Distributions

If the data is not symmetric:

Negative Skew

Use the Minimum Extreme Value Distribution for distributions with a long tail on the negative side.

Positive Skew

Use the Log-Normal Distribution for data such as stock prices, where extreme values can be significantly positive.

Cumulants and Kurtosis

The kurtosis of a distribution provides insights into tail behavior:

Choosing the Correct Distribution

Follow this flowchart for selecting appropriate distributions:

  1. Is the data continuous or discrete?

  2. Is the data symmetric or asymmetric?

  3. If symmetric, what are the tail characteristics (thin/fat)?

  4. If asymmetric, which direction is the skew?

Common distributions include:

Conclusion

Understanding the characteristics of your data and the appropriate distribution models is crucial for effective data analysis. This session covered a range of distributions and guidance for selection, which will aid in the interpretation of statistical findings.

Data Distributions in Finance

Introduction

This document provides an overview of the importance of data distributions in finance, particularly focusing on the normal distribution. While the normal distribution simplifies analysis and modeling, it can lead to significant misjudgments if the underlying data does not conform to this distribution.

Normal Distribution in Finance

Reasons for Dependency on Normal Distribution

The normal distribution has the probability density function (PDF):


$$f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$

where:

Implications of Assuming Normality

Assuming all data is normally distributed can lead to significant errors, especially in finance where extreme outcomes (fat tails) can occur more frequently than predicted by the normal distribution.

Analyzing Stock Returns

Case Study: Disney

Annual returns on Disney from 1962 to 2021 were analyzed. A histogram of returns illustrates the distribution visually.

Histogram

A histogram shows returns typically ranging from -20% to over +120%, indicating extreme yearly returns.

QQ Plot

A Quantile-Quantile (QQ) plot compares the quantiles of the empirical data to the quantiles of a normal distribution. Ideally, points should lie on the reference line if the data is normally distributed.

Statistical Tests for Normality

Conduct various statistical tests for normality, such as:

For Disney, 7 out of 9 tests did not reject the normality hypothesis.

Case Study: Apple

Annual returns on Apple from 1981 to 2020 were examined, showing a broader distribution with extreme values.

Variability of Returns

Long-Term vs. Short-Term Returns

Consequences of Misestimation of Risk

If investments are misled by assuming normality, risk management systems may fail to protect against extreme outcomes:

Log Prices

Transition to Log Prices

Log prices can be more normally distributed due to the transformation:


Log Price = ln (Price)

However, using logarithmic transformations does not guarantee normality. For example, post-logarithmic, Apple’s stock prices still showed signs of deviation from normality.

Summary and Best Practices

Conclusion

Understanding the underlying data distributions in finance is vital for accurate risk assessment and management. Further exploration into empirical distributions and non-normality in financial datasets is encouraged.

Analyzing Relationships Between Two Data Variables

Introduction

In previous sessions, we explored single data variables and their measures of centrality, dispersion, and skewness. In this session, we shift our focus to analyzing relationships between two data variables. This can be useful in evaluating whether variables such as price-earnings ratios and interest rates, or earnings growth and GDP growth, exhibit any linkage.

Linkage Between Variables

When examining two data series, we seek to determine if they:

If a relationship exists, we can further investigate potential lead or lag effects, in which one variable influences the timing of another.

Correlation and Causation

It is critical to differentiate between correlation and causation:

It’s important to note that correlation does not imply causation; variables may appear correlated due to random chance or third-party influences.

Scatter Plots

To visually assess the relationship between two variables, we create a scatter plot:

Scatter plots allow for a visual assessment, revealing potential correlations between the two variables.

Correlation Coefficient

To quantify the strength of the relationship, we compute the correlation coefficient r. The most widely used is the Pearson correlation coefficient, calculated as:


$$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}}$$

where:

The correlation coefficient r ranges from -1 to 1:

Covariance

Another measure of the relationship between two variables is covariance, defined as:


$$\text{Cov}(X, Y) = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n}$$

Unlike the correlation coefficient, covariance is not standardized, meaning its values can vary widely based on the scale of the variables. Thus, it is harder to interpret directly.

Regression Analysis

To analyze the relationship further, we can fit a regression line to the scatter plot. The most common approach is Ordinary Least Squares (OLS) regression, which minimizes the sum of the squared vertical distances (residuals) between the observed values and the regression line.

The slope β1 indicates the change in y per one unit change in x.

R-Squared

An important output from regression analysis is R2, which measures how well the independent variable explains the variation in the dependent variable. It quantifies the goodness of fit of the model:


$$R^2 = 1 - \frac{\text{SS}_{res}}{\text{SS}_{tot}}$$

where:

An R2 of 0 indicates that the model does not explain any variability, while 1 indicates perfect explanatory power.

Multiple Regression

When expanding to more than two variables, we can conduct multiple regressions. The model extends to:


y = β0 + β1x1 + β2x2 + … + βkxk + ϵ

where each xi is an independent variable.

Issues with Multiple Regression

  1. Multicollinearity : Occurs when independent variables are correlated with each other, complicating coefficient interpretations. VIF (Variance Inflation Factor) can be used to measure multicollinearity.

  2. Homoscedasticity : The residuals should display constant variance across all levels of the independent variable. Patterns in residuals may indicate a problem with the model.

  3. Normality of Residuals : The residuals should be normally distributed for OLS to yield statistically valid inference.

Non-linear Relationships

If the relationship between variables is non-linear, non-linear regression techniques can be employed, or you can transform variables to achieve linearity. Transformations may include taking logarithms or polynomial terms.

Conclusion

The analysis of relationships between two data variables involves a systematic approach of visual exploration (scatter plots), quantifying relationships (correlation and covariance), fitting models (regression analysis), and checking the underlying assumptions. These methods assess predictability, essential for sound decision-making in finance and investing.

When applying these methods, remember:

Statistical Relationships in Finance

Introduction

In finance, it is essential to understand relationships between data, as these relationships can apply to various macro and micro-level variables.

Understanding Relationships

Reasons to Analyze Relationships

There are two main motivations for analyzing data relationships:

Capital Asset Pricing Model (CAPM)

One key model in finance is the Capital Asset Pricing Model (CAPM), which assesses the risk of an investment relative to a market portfolio. The idea is that the risk of an asset should be evaluated in the context of a diversified portfolio.

Risk and Regression

To measure this risk, a regression analysis can be employed:
Ri = α + βRm + ϵ
Where:

Correlation and Covariance

Before running regression, it is useful to understand covariance and correlation between variables.

Regression Output Analysis

When analyzing the regression output:

Regression Coefficients

Understanding the coefficients from regression output:

Examples

Example: Earnings Yield

Running a regression of earnings yield against short and long-term treasury rates can reveal relationships.

Example: Price to Book Ratio

Analyzing a group of banks based on their price-to-book ratio, return on equity, and risk can help in identifying undervalued banks.

Regression Diagnostics

Assessing the residuals is vital to ensure that the assumptions of regression analysis (normality, homoscedasticity) hold true.

Final Thoughts

Regression Analysis

Introduction

In this session, we will delve into regression analysis, a fundamental statistical method used to explain relationships between variables. The key components to understand are:

Identifying Variables

To choose independent variables for regression analysis, there are two main approaches:

  1. Statistical Approach: Collect data and identify which independent variables correlate the most with the dependent variable.

  2. Common Sense and Economic Theory: Use economic models and intuition to select independent variables that logically influence the dependent variable.

Example: Explaining P/E Ratios

To illustrate, consider the P/E ratios across companies. The basic framework for the P/E ratio can be derived from the Gordon Growth Model:
$$P_0 = \frac{D_1}{r - g}$$
Where:

Dividing by earnings per share (EPS), we derive:
$$P/E = \frac{(D/E)}{(r - g)}$$
Where:

Thus, the relationship can be summarized as:
P/E = f(Payout Ratio, Growth Rate, Cost of Equity)

Running a Regression

Before running a regression, it’s essential to visually inspect the relationship between the dependent and independent variables using scatter plots.

An example scatter plot of P/E ratios against growth rates may reveal a positive relationship but high variability (noise) around the fitted line, indicating a potentially low R2 value when the regression is performed.

In our case:

Addressing Multicollinearity

When analyzing multiple independent variables, ensure they are independent of each other. High correlation among independent variables (multicollinearity) can distort results.

Improving R2

To enhance the model’s explanatory power:

Caution Against Data Mining

In practice, having more data can lead to data mining—selecting variables purely for a higher R2. This can lead to so-called p-hacking:

Statistical vs. Economic Significance

Remember, statistical significance does not imply economic or practical significance. A statistically significant model may not yield profitable investment decisions due to trading costs and market frictions.

Conclusion

Regression analysis is a powerful tool, but it requires careful consideration when determining independent variables, analyzing correlations, accounting for multicollinearity, and ensuring that statistical significance translates into economic reality.

Probability in Finance

Introduction to Probability

Probabilities are fundamental in understanding uncertain outcomes. In many situations, particularly in finance, we cannot predict events with certainty.

Definition

A probability measures the likelihood of an event occurring under uncertainty, mathematically formalized as follows:
P(A) = Probability of event A
where 0 ≤ P(A) ≤ 1:

Importance of Probability in Finance

In finance, estimating probabilities is crucial for decision-making. We often evaluate the likelihood of discrete events (e.g., bankruptcy) or continuous outcomes (e.g., future earnings).

Types of Events

Views of Probability

There are two primary views of probability:

Frequentist View

Subjectivist View (Bayesian View)

Types of Probabilities

Cumulative Probability

The likelihood of a series of outcomes occurring:
P(A1 ∩ A2 ∩ … ∩ An) = P(A1) + P(A2) + … + P(An) − P(A1 ∩ A2) − …

Conditional Probability

The probability of an event occurring given another event:
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

Basic Rules of Probability

Bayes’ Theorem

Bayes’ theorem relates conditional probabilities and shows how to update probabilities given new evidence:
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

This emphasizes that probabilities are derived from prior knowledge and can be adjusted with new information.

Probit and Logit Models

These models estimate probabilities for binary outcomes (0 or 1) based on independent variables.

Probit Models

Uses a normal distribution to convert predictions:
P(Y = 1|X) = Φ(Xβ)

Logit Models

Employs a logistic distribution:
$$P(Y=1|X) = \frac{1}{1 + e^{-X\beta}}$$

Example: Predicting Bankruptcy

Example: Predicting Acquisitions

Decision Trees

A useful way to visualize and assess sequential risk scenarios.

Components of Decision Trees

To analyze a decision tree, roll back from end nodes to make decisions based on expected values.

Scenario Analysis

Used to evaluate the impact of different future possible situations based on continuous outcomes.

Scenario Types

Conclusion

Understanding and applying probabilities is crucial in investing and finance. By utilizing tools such as Bayesian inference, Probit/Logit models, and decision trees, investors can better navigate uncertainty and make informed decisions.

Probabilistic Tools in Investing and Finance

Introduction

In this session, we explore the application of probabilistic tools to questions in investing and finance. We will delve into how markets operate under the assumption of a random walk and how this influences investment strategies.

Random Walk Hypothesis

Definition

The market is said to follow a random walk, meaning that there is an equal probability of the market going up or down on any given day:
P(up) = P(down) = 0.5

Expected Outcomes Over Time

If we observe the market over n trading days, we expect to see:
$$\text{Expected Up Days} = \frac{n}{2}, \quad \text{Expected Down Days} = \frac{n}{2}$$
For example, over 100 days, we anticipate roughly 50 up days and 50 down days.

Testing the Random Walk Hypothesis

Researchers collect data over n trading days and analyze the actual distribution of up and down days. To estimate the standard error (σ) for the difference, the formula is:
$$\sigma = \sqrt{\frac{P(\text{up}) \cdot P(\text{down})}{n}} = \sqrt{\frac{0.5 \cdot 0.5}{n}}$$
For n = 100:
$$\sigma = \sqrt{\frac{0.25}{100}} = 0.05$$

From the standard error, we can calculate confidence intervals:
95%Confidence Interval =  ± 2σ

Empirical Study: S&P 500 Analysis

Over 1257 trading days (2016-2020):

Calculating Standard Error

For the proportion of up days:
P(up) = 0.5569,  P(down) = 0.4431
The standard error is calculated as:
$$\sigma = \sqrt{\frac{0.5569 \cdot 0.4431}{1257}} \approx 0.0141$$

Confidence Interval

At 95% confidence,
Confidence Interval = 0.5569 ± 2(0.0141) = (0.5287, 0.5851)
Since 50% falls outside this range, we can reject the hypothesis of equal up and down days.

Conditional Probabilities

Given Previous Day’s Performance

If the market was up the previous day, we analyze the likelihood for the next day:

Statistical Significance

The probability following an up day is not significantly different from 50%. However, there is a significantly higher chance of up days after down days (59.6% vs. 50%).

Cumulative Events in Trading Days

Probability of Consecutive Events

To find the probability of consecutive updates or down days:
P(three up days) = (0.5569)3 = 0.173
For five up days:
P(five up days) = (0.5569)5 ≈ 0.0536

Performance Persistence and Fund Management

Transition Probability

Funds can be classified into quartiles based on past performance, and the likelihood of performance persistence can be analyzed.

Default Probabilities

Corporate Bonds Default Rates

The probability of corporate bonds defaulting changes with economic conditions.

Altman Z-Score

The Altman Z-score combines specific ratios to class companies based on their likelihood of bankruptcy:
$$Z = 1.2 \frac{WC}{TA} + 1.4 \frac{RE}{TA} + 3.3 \frac{EBIT}{TA} + 0.6 \frac{MV}{TL} + \text{constant}$$

Decision Trees

Sequential Decision Analysis

Decision trees are useful for modeling complex decisions under uncertainty, such as in pharmaceutical approvals.

Expected Value Calculation

Expected value can be calculated as:
Expected Value = P(success) × Cash Flow

Scenario Analysis

Valuation Under Different Scenarios

Scenario analysis aids in evaluating the impact of different potential future events on value, focusing on extreme and likely outcomes.

Monte Carlo Simulations in Finance

Introduction

In this session, we will explore the integration of data distributions, probabilities, and Monte Carlo simulations as applied in data analysis and decision-making, particularly in finance.

Point Estimates vs. Distributions

Typically, in finance, we estimate independent variables which inform an output variable that we are attempting to explain.

Point Estimates

Transition to Distributions

Monte Carlo Simulations

Monte Carlo simulations allow us to:

Building a Solid Model

For successful simulations:

  1. Build a sound base model connecting independent variables to the output variable.

  2. Keep the model simple and transparent.

  3. Focus on the most significant independent variables in terms of their effect on outputs.

Understanding Uncertainties

Analyzing uncertainties involves distinguishing between: Discrete vs. Continuous Uncertainty:

Symmetric vs. Asymmetric Risk:

Choice of Distributions

It is crucial to identify the right type of distributions (e.g., normal, log-normal, triangular) to model your uncertainties.

Application: Valuing Apple Inc.

In a valuation example performed in May 2016: Point Estimates:

A base case valuation provided a value of 126 per share.

Selecting Distributions

Running the Simulation

Using Crystal Ball (an Excel add-on):

Results

Median Value: Approximately $123 per share.
Value Range:

Indicates a high probability (93%) that Apple is undervalued compared to its market trading price of $93.

Conclusions

Monte Carlo simulations provide not just a point estimate but a fuller picture of uncertainty, allowing analysts to make more informed decisions. By looking at the distribution of potential outcomes, analysts can more effectively address and understand the risks involved.

Key Takeaways