class="mb-1"

contents

Probability

Counting Theory

Why Counting?

Experiments and Outcomes

Step Rule of Counting

If an experiment consists of multiple steps, the total number of outcomes is the product of outcomes at each step.
N = n1 × n2 × … × nk
where ni is the number of outcomes in step i.

Example: Rolling Two Dice

Summation Rule of Counting

If outcomes can come from two or more events (sets), the total number of outcomes equals the sum of outcomes from each event minus any overlaps.
N = |A| + |B| − |A ∩ B|
where |A| = size of set A, |B| = size of set B.

Problems for Practice

Unique Arrangements

Solution Approach

General formula for unique arrangements of a multiset:
$$N = \frac{n!}{n_1! \times n_2! \times \ldots \times n_k!}$$
where n is the total number of items and ni is the count of each unique item.

Example Calculation for "boba"


$$\text{Total Letters} = 4 \quad (b, o, b, a) \\ \text{Arrangements} = \frac{4!}{2!} = \frac{24}{2} = 12$$

Conclusion

Counting Principles in Probability and Combinatorics

Introduction

In this lecture, we focus on the essential rules of counting, which serve as the foundation for probability theory and combinatorial mathematics. Key concepts include permutations, combinations, and methods for counting arrangements of indistinguishable objects.

Key Rules of Counting

Fundamental Counting Principles

The two core rules of counting are:

  1. Multiplication Rule: If an experiment can be broken down into a series of steps where the number of outcomes at each step does not depend on previous outcomes, the total number of outcomes is the product of the number of outcomes at each step.


    n = n1 × n2 × … × nk

    where n is the total number of outcomes, and ni are the number of outcomes for each step.

  2. Addition Rule: If an experiment can result in outcomes from two or more distinct groups, the total number of outcomes is the sum of the outcomes from each group, minus any overlaps.


    n = nA + nB − nA ∩ B

Permutations

A permutation refers to an ordered arrangement of objects.

Distinct Objects

The number of permutations of n distinct objects is given by:


P(n) = n!

For example, for the letters in "CHRIS":
P(5) = 5! = 120

Indistinct Objects

If some objects are indistinct, the formula for permutations changes to account for the indistinguishable items.

The general formula for n objects where there are groups of indistinguishable objects is:


$$P(n; n_1, n_2, \ldots, n_k) = \frac{n!}{n_1! \cdot n_2! \cdot \ldots \cdot n_k!}$$

For example, for the word "MISSISSIPPI":
$$P = \frac{11!}{4! \cdot 4! \cdot 2!}$$

where there are four ’I’s, four ’S’s, and two ’P’s.

Combinations

A combination refers to the selection of objects without regard to the order.

Distinct Objects

If we want to choose k objects from n distinct objects, we utilize the combinations formula:


$$C(n, k) = \frac{n!}{k!(n-k)!}$$

For example, to choose 3 books from 6:
$$C(6, 3) = \frac{6!}{3! \cdot 3!} = 20$$

Indistinct Objects

When dealing with indistinguishable objects, the concept is similar, but the result will count each unique grouping once. For example, choosing 3 indistinguishable items from a set where items may overlap is treated carefully.

Putting Objects into Buckets

If you have n indistinct objects to put into r distinct buckets, the formula is given by:


$$\text{Ways} = \frac{(n + r - 1)!}{n! \cdot (r - 1)!}$$

This is derived from the method of "stars and bars" or dividers.

Example

To place 4 indistinguishable objects into 3 distinct buckets:
$$\text{Ways} = \frac{(4 + 3 - 1)!}{4! \cdot (3 - 1)!} = \frac{6!}{4! \cdot 2!} = 15$$

Conclusion

Understanding these foundational counting principles is crucial for tackling more complex problems in probability and combinatorics. Future classes will build on these concepts, particularly as we apply them to probability theory.

Introduction to Probability for Computer Scientists

Course Overview

This course marks the beginning of our journey into probability theory, specifically tailored for computer scientists. Get ready to delve deeply into the world of probabilities!

Counting Techniques Recap

In prior classes, we covered advanced counting techniques:

  1. Permutations: The number of ways to arrange n distinct items:
    P(n) = n!

  2. Combinations: The number of ways to choose k items from n:
    $$C(n, k) = \frac{n!}{k!(n-k)!}$$

  3. Distributions (Bucketing): Number of ways to distribute n distinct items into k buckets.

Basic Concepts of Probability

Sample Space (S)

The sample space is the set of all possible outcomes of an experiment.

Event Space (E)

An event is a subset of the sample space that satisfies a particular condition.

Definition of Probability

The probability of an event E occurring is denoted P(E) and is defined as:
$$P(E) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}} \quad \text{(if equally likely)}$$
Where P(E) must satisfy:
0 ≤ P(E) ≤ 1

Axioms of Probability (Kahneman)

  1. Non-negativity: P(E) ≥ 0

  2. Normalization: P(S) = 1

  3. Additivity: For mutually exclusive events A and B,
    P(A ∪ B) = P(A) + P(B)

Equally Likely Outcomes

When all outcomes in the sample space are equally likely, the probability of an event E is given by:
$$P(E) = \frac{|E|}{|S|}$$
where |E| is the number of outcomes in event E and |S| is the number of outcomes in the sample space.

Example: Rolling Two Dice


$$P(\text{sum} = 7) = \frac{6}{36} = \frac{1}{6}$$
Here, we count the pairs: (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1).

Complex Counting Example: Cows and Pigs

Consider 4 cows and 3 pigs, we want to find the probability of drawing 1 cow and 2 pigs.

  1. Sample Space: C(7, 3)

  2. Event Space: Choose 1 cow from 4, and choose 2 pigs from 3:
    $$P(\text{1 cow, 2 pigs}) = \frac{C(4, 1) \times C(3, 2)}{C(7, 3)} = \frac{4 \times 3}{35} = \frac{12}{35}$$

Card Probabilities: Straights in Poker

To calculate the probability of drawing a straight hand from a deck:

  1. Sample Space: C(52, 5)

  2. Event Space Count: Count valid straights: - Choose starting card value: 10 (e.g., (2, 3, 4, 5, 6)) - Choose suits for each card (45):
    $$P(\text{Straight}) = \frac{10 \cdot 4^5}{C(52, 5)}$$

Probability of a Defective Chip

With N chips, one defective, and K chosen:
$$P(\text{defective in sample}) = \frac{K}{N}$$

Final Remark

Probability theory is a profound language that helps us quantify uncertainty and understand the randomness inherent in our world.

Introduction to Probability

Review of Last Week’s Topics

Last week focused on combinatorics and the foundations of probability. We covered:

For example, arranging 52 unique cards yields 52! (52 factorial). This leads to the understanding that many arrangements are readily available, likely leading to unique outcomes.

Probability Axioms

We discussed the three fundamental axioms of probability:

  1. P(E) ≥ 0 for any event E

  2. P(S) = 1 where S is the sample space

  3. For any two mutually exclusive events E and F:
    P(E ∪ F) = P(E) + P(F)

What this means is that if no outcomes overlap between E and F, the probability of E or F happening can be calculated by summing their probabilities.

Conditional Probability

A key concept in probability is conditional probability, denoted as:
$$P(E \mid F) = \frac{P(E \cap F)}{P(F)}$$
where P(E ∣ F) is the probability of event E given F has occurred.

Mutually Exclusive Events

If two events E and F are mutually exclusive, then:
P(E ∪ F) = P(E) + P(F)
We can apply these concepts to practical situations, illustrating through examples such as card drawing.

The Law of Total Probability

The law of total probability states:
P(E) = P(E ∣ F)P(F) + P(E ∣ Fc)P(Fc)
This allows for assessing P(E) based on conditional probabilities related to disjoint partitioning events F and Fc.

Bayes’ Theorem

Once we establish a conditional probability, we can explore probabilities in reverse using Bayes’ Theorem:
$$P(F \mid E) = \frac{P(E \mid F) P(F)}{P(E)}$$
This allows us to compute the probability of an event given new evidence.

Example: Testing for Disease

Consider a medical test: - The probability of having the disease = P(D) - The probability of testing positive given you have the disease = P(T ∣ D) - The probability of testing positive given you do not have the disease = P(T ∣ Dc)

An example using these probabilities illustrates practical usage:
$$P(D \mid T) = \frac{P(T \mid D) P(D)}{P(T)}$$

Updated Beliefs and Evidence

When we observe an event: - Our a priori beliefs (prior probabilities) can be updated based on evidence. - This process increases our methodological robustness in decision-making.

Conclusion

In summary, you’ve acquired tools for dealing with uncertainty through the establishment of probabilities, including:

Spend some time working through these principles, as they are essential for understanding both theoretical and applied statistics and probability. Remember to practice using Bayes’ Theorem as it is crucial in many real-world applications.

Independence and Probability

Introduction

In this lecture, we delve into the concept of independence, which is critical for understanding probability theory and building algorithms that make decisions under uncertainty.

Review of Probability

Probabilities of Events

Conditional Probability

The definition of conditional probability states:
$$P(A | B) = \frac{P(A \cap B)}{P(B)}$$
Rearranging this gives us the Chain Rule:
P(A ∩ B) = P(A) ⋅ P(B|A)
This formula is fundamental in probability.

Law of Total Probability and Bayes’ Theorem

The Law of Total Probability states:
P(A) = ∑iP(A|Bi)P(Bi)
where Bi partitions the sample space.

Bayes’ theorem relates conditional probabilities:
$$P(B | A) = \frac{P(A | B) P(B)}{P(A)}$$

Mutual Exclusivity vs Independence

Mutual Exclusivity

Two events A and B are mutually exclusive if they cannot occur at the same time:
P(A ∩ B) = 0
If they are mutually exclusive, the probability of either event occurring is:
P(A ∪ B) = P(A) + P(B)

Independence

Two events A and B are independent if the occurrence of one does not affect the other:
P(A|B) = P(A)
This also means:
P(A ∩ B) = P(A) ⋅ P(B)

Conditions for Multiple Events

For n independent events E1, E2, …, En, the probability that all occur is:
P(E1 ∩ E2 ∩ … ∩ En) = P(E1) ⋅ P(E2)⋯P(En)

Calculating Probabilities

Probability of K Heads in Coin Flips

When flipping n coins with probability p of heads, the probability of exactly k heads is:
$$P(k \text{ heads}) = \binom{n}{k} p^k (1-p)^{n-k}$$

Example Problems

Example 1: Coin Flips

If we flip 10 coins and each coin has a probability of 0.6 for heads, what is the probability of getting exactly 6 heads?
$$P(6 \text{ heads}) = \binom{10}{6} (0.6)^6 (0.4)^4$$

Example 2: Router Functionality

If independent routers have various probabilities of functioning, the probability that at least one router is functioning can be calculated as:
$$1 - P(\text{no router functioning}) = 1 - \prod_{i=1}^{n} (1 - p_i)$$

Conclusion

Understanding independence and mutual exclusivity are crucial skills for computing probabilities efficiently. The definitions and calculations introduced in this lecture lay the groundwork for solving more complex probability scenarios.

Probability and Random Variables

Introduction

The lecture began with a warm welcome and an overview of logistics concerning problem sets and upcoming workshops. An introduction to key concepts in probability was provided, setting the stage for discussions on Bayes’ Theorem, random variables, and expectations.

Key Concepts in Probability

Bayes’ Theorem

Bayes’ Theorem is a fundamental concept in probability that describes how to update the probability estimate for a hypothesis as more evidence is acquired. It is expressed as:
$$P(H | E) = \frac{P(E | H) P(H)}{P(E)}$$
where:

Law of Total Probability

The denominator of Bayes’ Theorem can be expressed using the Law of Total Probability:
P(E) = ∑iP(E|Hi)P(Hi)
This allows for the calculation of probabilities over multiple possible outcomes.

Problem Set Insights

In the context of the problem sets, students were tasked with programming a solution involving Bayes’ Theorem, exploring more generalized scenarios with multiple outcomes rather than binary situations.

Random Variables

Definition

A random variable X is a variable whose possible values are numerical outcomes of a random phenomenon. Random variables can be discrete or continuous.

Probability Mass Function (PMF)

A discrete random variable has a associated probability mass function P(X = x) that provides the probability that the random variable takes on the value x. The PMF must satisfy:
xP(X = x) = 1

Expectation

The expectation (or mean) of a random variable X is a measure of the central tendency of the random variable, defined as:
E[X] = ∑xxP(X = x)
This value may not correspond to a possible outcome of the random variable but gives a central measure.

Example of Expectation

For a fair six-sided die, the expectation is calculated as follows:
$$E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5$$

Conditional Independence

Two events E and F are conditionally independent given a third event G if:
P(E ∩ F|G) = P(E|G)P(F|G)
This means that information about event G does not change the dependence between E and F.

Clarifications and Common Confusions

Events vs. Random Variables

It is important to note the distinction between events and random variables:

Conclusion and Expectations from Random Variables

The lecture concluded with a challenge involving a game based on coin flips and expected winnings, leading to discussions around the implications of expectations when probabilistic outcomes diverge significantly.

Further Discussion

Students were encouraged to ask questions and discuss insights gained from the problem sets and assignments.

Random Variables, Binomial Distribution, and Variance

Learning Goals

By the end of this lecture, you should be able to:

Introduction to Random Variables

A random variable is a numerical outcome of a probabilistic process. It can take various outcomes, each with an associated probability. The relationship between these outcomes and their probabilities is captured by the probability mass function (PMF).


P(X = x) = PMF(x)

where X is a random variable, x is a particular outcome.

Expectation

The expectation (mean) of a random variable X is defined as:
E[X] = ∑xx ⋅ P(X = x)

Key properties of expectation:

Binomial Distribution

A Binomial random variable models the number of successes in n independent Bernoulli trials, each with a success probability p.

The PMF of a binomially distributed random variable X is given by:
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$
where $\binom{n}{k}$ is the binomial coefficient.

Examples of the Binomial Distribution

Special Case: Bernoulli Random Variable

A Bernoulli random variable is a special case of the binomial distribution where n = 1.

It takes on two values:
$$X = \begin{cases} 1 & \text{with probability } p \\ 0 & \text{with probability } 1-p \end{cases}$$

Expectation of a Bernoulli Variable


E[X] = p

Variance

The variance of a random variable gives a measure of how spread out the values are. The formula for variance Var(X) is defined as:
Var(X) = E[(X − μ)2]
where μ = E[X].

Using the law of unconscious statistician:
Var(X) = E[X2] − (E[X])2
where
E[X2] = ∑xx2 ⋅ P(X = x)

Conclusion

In today’s lecture, we covered the fundamentals of random variables, focusing on binomial and Bernoulli distributions, and calculated their expectations and variances. These concepts form the bedrock of probability theory, allowing for better understanding of data in future sessions.

Introduction to Poisson Distribution

Introduction

In today’s class, we will explore the Poisson distribution, a fundamental concept in probability theory that is widely applicable in fields such as epidemiology, finance, and environmental science.

Random Variables

Definition

A random variable is a variable that takes on values probabilistically, inheriting all the associated mathematical properties established by previous mathematicians.

Types of Random Variables

We primarily distinguish between:

We have previously discussed discrete random variables, particularly the Binomial distribution, which models the number of successes in independent trials.

Binomial Distribution

Definition

Let n be the number of trials, and p the probability of success on each trial. The Binomial distribution models the number of successes, denoted as X ∼ Binomial(n, p).

Probability Mass Function

The probability of exactly k successes in n trials is given by:
$$P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}$$
where the binomial coefficient is defined as:
$$\binom{n}{k} = \frac{n!}{k!(n - k)!}$$

Expectation and Variance

For a Binomial random variable:

Introduction to Poisson Distribution

The Poisson distribution models the number of occurrences of an event over a fixed interval of time or space given a known average rate λ.

Definition

A random variable Y is Poisson distributed if:
Y ∼ Poisson(λ)
where λ is the average rate of occurrences.

Probability Mass Function

The Poisson probability mass function is given by:
$$P(Y = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$
for k = 0, 1, 2, ….

Connection to Binomial Distribution

When the number of trials n is large, and the probability of success p is small such that λ = n ⋅ p remains moderate, the Poisson distribution can be used as an approximation for the Binomial distribution.

Expectation and Variance

For a Poisson random variable:

Applications of Poisson Distribution

The Poisson distribution applies to various real-world scenarios, often where events occur independently over time, such as:

Example: Analyzing Hurricane Data

In our class, we analyzed data from the HerDat Database, which records hurricanes in the Atlantic. The goal was to determine the average number of hurricanes per year.

Calculating the Rate

The average rate λ was calculated as follows:
$$\lambda = \frac{\text{Total Hurricanes}}{\text{Number of Years}}$$

Based on recorded data from 1851 to 1966: - Total Hurricanes: 975 - Number of Years: 115 - $\lambda = \frac{975}{115} \approx 8.5$

Poisson Distribution Analysis

With λ = 8.5, we can model the number of hurricanes in a given year using a Poisson distribution:
$$P(Y = k) = \frac{8.5^k e^{-8.5}}{k!}$$

We can determine the probability of having more than 15 hurricanes in a given year.


$$P(Y > 15) = 1 - P(Y \leq 15) = 1 - \sum_{k=0}^{15} P(Y = k)$$

From historical data, the probability of exceeding 15 hurricanes in a year was found to be 1.3%.

Distribution Shift Analysis

Post-1966, the data indicated years with 30 hurricanes, which were improbable under the initial Poisson model. The new analysis suggested a different Poisson distribution with potentially higher rates due to advancements in hurricane detection.

Conclusion

Today’s exploration of the Poisson distribution showcased its utility in modeling rare events and highlighted the importance of recognizing underlying assumptions. Further insights were gathered through the analysis of hurricane data, which indicated a potential shift in hurricane frequency and intensity.

Notes on Continuous Random Variables

Introduction

In this class, we will explore the concept of continuous random variables, building on prior knowledge of discrete random variables. Key topics will include various types of distributions, their properties, and the mathematical tools used to analyze them.

Jump to Random Variables

Random Variables

Discrete Random Variables Recap

1. Binomial Random Variable Represents the number of successes in n independent Bernoulli trials.
Probability Mass Function (PMF):
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n$$
Expectation:
E[X] = np
Variance:
Var[X] = np(1 − p)

2. Poisson Random Variable Models the number of events occurring in a fixed interval of time/space.
PMF:
$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$
Expectation:
E[X] = λ
Variance:
Var[X] = λ

New Random Variables

1. Geometric Random Variable Models the number of trials until the first success.
PMF:
P(X = k) = (1 − p)k − 1p,  k = 1, 2, …
Expectation:
$$E[X] = \frac{1}{p}$$
Variance:
$$Var[X] = \frac{1-p}{p^2}$$

2. Negative Binomial Random Variable Represents the number of trials until r successes.
PMF:
$$P(X = n) = \binom{n-1}{r-1} p^r (1-p)^{n-r}, \quad n = r, r+1, r+2, \ldots$$
Expectation:
$$E[X] = \frac{r}{p}$$
Variance:
$$Var[X] = \frac{r(1-p)}{p^2}$$

Continuous Random Variables

1. Introduction

2. Probability Density Function (PDF) The PDF, f(x), gives the relative likelihood for the random variable to take on a given value.
Areas under the PDF represent probabilities:


P(a < X < b) = ∫abf(x) dx


 − ∞f(x) dx = 1

3. Uniform Distribution

All outcomes are equally likely between a and b.
PDF:
$$f(x) = \begin{cases} \frac{1}{b-a}, & \text{if } a \leq x \leq b \\ 0, & \text{otherwise} \end{cases}$$

4. Exponential Distribution

Models the time until an event occurs (e.g., waiting time until an earthquake).
PDF:
f(x) = λe − λx,  x ≥ 0
Expectation:
$$E[X] = \frac{1}{\lambda}$$
Variance:
$$Var[X] = \frac{1}{\lambda^2}$$

Application of Continuous Random Variables

1. Using the PDF to Calculate Probabilities - Probability that X exceeds a value can be derived using:
P(X > x) = 1 − P(X ≤ x) = 1 − ∫ − ∞xf(t) dt

2. Cumulative Distribution Function (CDF) - CDF, F(x), is the probability that the random variable is less than or equal to a certain value:
F(x) = P(X ≤ x) = ∫ − ∞xf(t) dt

3. Conversion from PDF to CDF - To compute a desired probability, leverage the CDF when possible:
P(a < X < b) = F(b) − F(a)

Conclusion

Understanding continuous random variables and their corresponding properties is essential in applied statistics and fields such as machine learning. This framework allows for the analysis of real-world phenomena where predictions and assessments rely on an understanding of distributions, expectations, and variances.

Notes on the Normal Distribution

Random Variables

Recall that random variables can be classified into:

Probability and Continuous Variables

For continuous random variables:

The Normal Distribution

The Normal Distribution, also known as the Gaussian distribution, is defined by its probability density function:
$$f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$
where:

Properties

Cumulative Distribution Function (CDF)

The CDF for the normal distribution can be expressed as:
F(x) = P(X ≤ x) = ∫ − ∞xf(t) dt
Since it has no closed form, it is often approximated or referenced through standardized values.

Standard Normal Distribution

A special case of the normal distribution is the standard normal distribution which has μ = 0 and σ2 = 1. The PDF simplifies to:
$$f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}}$$
To convert any normal variable to the standard normal form, we apply:
$$Z = \frac{X - \mu}{\sigma}$$

Applications

Using the Normal Distribution

Probability problems involving the normal distribution generally follow these steps:

  1. Define the random variable.

  2. Determine its mean and standard deviation.

  3. Utilize the Z-score to convert to standard normal.

  4. Reference the CDF, either through tables or computational tools.

Continuity Correction

For discrete distributions approximated by continuous distributions (e.g., Binomial approximated by Normal):

Computer Simulation and Sampling

Many problems can be solved through computer simulation using the Monte Carlo method or direct sampling from the distributions using software tools that provide standard normal computation.

Conclusions

The Normal Distribution is a foundational concept in statistics and used in various fields, from natural phenomena to data science. Its mastery is crucial for understanding more advanced topics in probability and statistics.

Probability Notes

Introduction

In this session, we explored exciting concepts in probability, focusing on the transition from basic probability theory to solving real-world problems using probabilistic models. We also discussed a historical mystery related to the authorship of the Federalist Papers.

Normal Distribution

The normal distribution is a fundamental concept in probability that has applications in various real-world situations.

Probability Density Function (PDF)

The normal distribution is defined by its probability density function, given by:


$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$

where:

Cumulative Distribution Function (CDF)

The cumulative distribution function (CDF) is defined as:


$$F(x) = P(X \leq x) = \Phi\left(\frac{x - \mu}{\sigma}\right)$$

where Φ represents the CDF of the standard normal distribution.

Finding Probabilities

To find the probability that a normally distributed random variable X is less than some value x, we use the formula:


$$P(X < x) = \Phi\left(\frac{x - \mu}{\sigma}\right)$$

where μ is the mean and σ is the standard deviation.

Joint Probability Mass Function

When dealing with two random variables X and Y, the joint probability mass function (PMF) is represented as P(X = x, Y = y). It provides complete information about the probabilities of different outcomes.

Marginalization

The marginal probability of X can be obtained by summing over all possible values of Y:


P(X = x) = ∑yP(X = x, Y = y)

This technique allows us to extract information about one variable while ignoring the other.

Joint Probability Table

For discrete random variables, a joint probability table can be created to summarize the probabilities of all combinations of values taken by X and Y.

Multinomial Distribution

The multinomial distribution generalizes the binomial distribution for cases with more than two outcomes. It describes the probabilities of counts of k different outcomes observed in n trials.

Probability Mass Function

The multinomial PMF is given by:


$$P(X_1 = k_1, X_2 = k_2, \ldots, X_k = k_k) = \frac{n!}{k_1! k_2! \cdots k_k!} p_1^{k_1} p_2^{k_2} \cdots p_k^{k_k}$$

where pi is the probability of outcome i occurring, and ki is the count of those outcomes.

Application: Authorship of the Federalist Papers

In investigating who authored the Federalist Papers, specifically Paper 53, we can model the problem using the multinomial distribution.

Data Collection

We collect the text from the Federalist Papers and text segments known to be written by Hamilton and Madison. The goal is to use word counts to derive probabilities.

Bayes’ Theorem in Authorship Attribution

We want to find the probability of an author given the document:


$$P(H | D) = \frac{P(D | H)P(H)}{P(D)}$$

This leads to the ratio:


$$\frac{P(D | H)P(H)}{P(D | M)P(M)}$$

The multinomial model allows us to compute P(D|H) and P(D|M) efficiently based on word counts observed in the text.

Conclusion

In summary, the session covered fundamental probability concepts including the normal distribution, marginalization in joint distributions, and the multinomial distribution. We also discussed an intriguing application involving the Federalist Papers, demonstrating the power of probabilistic models in solving historical questions.

Inference and Belief Updates

Lecture Overview

Key Concepts

Random Variables and Inference

Inference is formally about how we adjust our beliefs regarding random variables upon observing new evidence. The equations involved include:
$$\begin{aligned} P(A | B) &= \frac{P(A, B)}{P(B)}\end{aligned}$$
where P(A|B) is the conditional probability of event A given event B.

Joint Distributions

Understanding multiple random variables interacting is vital. For instance, if we have two random variables X and Y, their joint distribution is represented as:
$$\begin{aligned} P(X, Y) = P(X | Y) \cdot P(Y)\end{aligned}$$

Bayes’ Theorem

Bayes’ theorem is fundamental in updating our beliefs. It can be stated as:
$$\begin{aligned} P(H | D) = \frac{P(D | H) \cdot P(H)}{P(D)}\end{aligned}$$
where:

Probabilities and Density Functions

A critical insight is understanding that for continuous random variables, the probability at a precise value is zero. Instead, we consider ranges or density functions. For a continuous random variable:
$$\begin{aligned} P(X = x) = 0 \quad \text{for any specific } x\end{aligned}$$
However, we can define:
$$\begin{aligned} P(a < X < b) = \int_a^b f_X(x) \, dx\end{aligned}$$
where fX(x) is the probability density function.

Epsilon Trick

When working with the probability density function:
$$\begin{aligned} P(X = x) \approx f_X(x) \cdot \epsilon\end{aligned}$$
where ϵ represents a very small number.

Application: Elephants

We utilized a hypothetical situation involving elephant weights to illustrate Bayesian inference with continuous and discrete probability:
$$\begin{aligned} P(G = 1 | X = 163) = \frac{P(X = 163 | G = 1) \cdot P(G = 1)}{P(X = 163)}\end{aligned}$$
where G = 1 signifies a female elephant.

Normal Distributions

For the weights:
$$\begin{aligned} X | G=1 \sim \mathcal{N}(\mu_1, \sigma_1^2) \quad \text{and} \quad X | G=0 \sim \mathcal{N}(\mu_0, \sigma_0^2)\end{aligned}$$
where you would compute P(X = 163|G = 1) using the normal PDF:
$$\begin{aligned} f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}\end{aligned}$$

Conclusion and Takeaways

Understanding inference involves grasping how to update our beliefs using joint distributions and applying Bayes’ theorem effectively. Continuous variables require special consideration, especially in terms of using density functions instead of probabilities.

We will explore practical applications of this mathematics in real-world scenarios, specifically regarding the Stanford Acuity Test, an eye test that employs inference principles to update beliefs about visual acuity.

Bayesian Inference and Item Response Theory

Introduction

In this lecture, we explore Bayesian inference in the context of the Stanford Eye Test, where we estimate an individual’s visual acuity based on their responses to letters of varying sizes.

Bayesian Inference

Bayesian inference allows us to update our beliefs about a random variable based on new evidence. The fundamental tool we use is Bayes’ Theorem:


$$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$$

Where:

This theorem enables us to update our probability beliefs as we gain more evidence.

Application: Stanford Eye Test

The Stanford Eye Test involves presenting letters of varying sizes to a subject and recording whether their responses are correct or not. Given a letter size s and a response y (correct/incorrect), we want to update our belief about the subject’s visual acuity A:


$$P(A | Y = y, S = s) = \frac{P(Y = y | A, S = s) P(A)}{P(Y = y | S = s)}$$

Prior Belief

The prior belief about the ability to see can be represented in various forms:

Likelihood

The likelihood term P(Y = y|A, S = s) is critical. It describes how likely an observation (getting a specific letter correct or incorrect) is given the subject’s ability:


P(Y = y|A = a, S = s) = f(a, s)

Where f(a, s) is a function determined through empirical data or theoretical models.

Normalization Constant

The normalization constant P(Y = y|S = s) assures that probabilities sum to one after updating:


P(Y = y|S = s) = ∑aP(Y = y|A = a, S = s)P(A = a)

This is calculated across all possible acuity levels a to ensure the posterior probability is valid.

Item Response Theory

Item Response Theory (IRT) is a statistical framework used for modeling the probability of a correct response on a test item based on a person’s latent traits (ability) and the item’s characteristics (difficulty).


$$P(y = 1 | a, d) = \sigma(a - d) = \frac{1}{1 + e^{-(a - d)}}$$

Where:

The probability of an incorrect response is simply:


P(y = 0|a, d) = 1 − P(y = 1|a, d)

Incorporating Guessing and Slipping

It’s vital to account for potential guessing errors or slips (failing to respond correctly even when the answer is known):


P(y = 1|a, d) = (1 − pslip)σ(a − d) + pguess

Where pslip is the probability of making a mistake despite knowing the answer, and pguess is the probability of guessing correctly.

Updating Beliefs with Multiple Observations

When new observations arise, we can iteratively update our beliefs. Each new observation’s corresponding posterior becomes the prior for the next iteration.

For two observations, we have:


$$\begin{aligned} P(A | Y_1 = y_1, S_1 = s_1) & \text{ (update belief with the first observation)} \\ P(A | Y_2 = y_2, S_2 = s_2) & \text{ (update belief with the second observation using the posterior from the first)}\end{aligned}$$

This process loops through each observation, gradually refining our estimates of acuity based on cumulative evidence.

Conclusion

By applying Bayes’ theorem and item response theory, we can create a robust framework for estimating human visual acuity based on probabilistic models. Understanding how to update beliefs with new evidence is fundamental not only in this context but across various domains.

Probabilistic Modeling and Bayesian Networks

Understanding Joint Distributions

The core idea of probabilistic models is the concept of joint distributions. A joint distribution allows us to describe the probability of multiple random variables occurring at once:


P(X1, X2, …, Xn)

This represents the complete information about the probabilistic model involving random variables X1, X2, …, Xn.

Inference

Once we have a joint distribution, we can perform inference, which involves updating beliefs about one or more variables given the observed values of others.

An example discussed was whether one could determine the gender of an elephant from its weight. We start with the prior distribution:


P(Gender|Weight)

and apply Bayes’ theorem to compute the posterior distribution:


$$P(Gender | Weight) = \frac{P(Weight | Gender) \cdot P(Gender)}{P(Weight)}$$

In more complex scenarios, such as inferring a variable that can take multiple values, we may need to perform multiple iterations of Bayes’ theorem.

Bayesian Networks

A Bayesian Network is a graphical model that represents a set of variables and their conditional dependencies through a directed acyclic graph (DAG).

Structure of a Bayesian Network

Each node represents a random variable, while directed edges imply causality. For example, consider the following variables:

The structure might look like this:


$$\begin{array}{c} \text{Flu} \\ \downarrow \\ \text{Fever, Tiredness} \\ \downarrow \\ \text{Undergraduate} \end{array}$$

Probabilities in Bayesian Networks

To define a Bayesian Network, we need to specify probabilities for each variable conditioned on its parents. For example:


P(Fe|F)  and  P(T|F, U)

This leads to a product of probabilities that reconstructs the joint distribution:


P(F, T, Fe, U) = P(F) ⋅ P(U) ⋅ P(Fe|F) ⋅ P(T|F, U)

Calculating Covariances and Independence

In probabilistic modeling, understanding independence and the relation between random variables is paramount.

Independence Definition

Two random variables X and Y are independent if:


P(X ∩ Y) = P(X) ⋅ P(Y)

This dictate that knowledge about X provides no information about Y, and vice versa.

Covariance

Covariance provides a measure of how much two random variables co-vary. It is defined as:


Cov(X, Y) = E[(XE[X])(YE[Y])]

Alternatively, it can be computed as:


Cov(X, Y) = E[XY] − E[X]E[Y]

Finding Independencies from Data

To discover the independence structure within data, one can calculate the covariance matrix. For instance, if variables have high covariance, they may be related.

Conclusion

This lecture introduced us to probabilistic modeling and Bayesian networks. Understanding joint distributions and inference allows us to compute complex relationships in data efficiently. Furthermore, covariance and independence are essential for developing robust probabilistic models.

Future Work

In upcoming lectures, we will explore rejection sampling and further applications of Bayesian networks.

General Inference and Rejection Sampling

Introduction

In today’s lecture, we cover important concepts related to probability inference, particularly focusing on an algorithm called Rejection Sampling. This technique is critically useful in probabilistic models, particularly Bayesian Networks (B-Nets).

Key Concepts

Bayesian Networks

A Bayesian Network is a graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). The notable features include:

Conditional Probability and Inference

Inference in a Bayesian Network involves calculating conditional probabilities that define the relationships among variables. Given random variables A and B:


$$P(A | B) = \frac{P(A, B)}{P(B)}$$

Computing P(A, B) typically requires marginalization over other variables, potentially yielding computationally intense calculations if many variables exist.

Covariance

Covariance quantifies how much two random variables change together. For random variables X and Y, it is defined as:


Cov(X, Y) = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y]

The range of covariance is not bounded, which complicates interpretations.

Correlation

Correlation, a scaled version of covariance, is bounded between -1 and 1. It is computed as:


$$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$

High correlations should be interpreted carefully, as non-independent variables can also yield zero correlation.

Rejection Sampling

Rejection sampling is a Monte Carlo method for generating random samples from a distribution by using easy-to-sample distributions. The primary steps are:

  1. Generate a large number of samples from the joint distribution.

  2. Retain samples that conform to specific observed conditions.

The algorithm can be summarized as follows:

Challenges in Rejection Sampling

If observations are rare, many samples will be rejected, leading to inadequate estimates of probability.

Sampling Continuous Variables

To sample continuous random variables, such as temperature (fever), the following process can be employed:

  1. Generate from a normal distribution based on the parent variable’s state (e.g., flu).

  2. Discretize the continuous variable to ensure matches during rejection sampling.

Applications and Future Directions

Understanding Bayesian inference and rejection sampling enables practical applications in various domains such as healthcare (e.g., WebMD) and education (e.g., student feedback). Advanced methods like Markov Chain Monte Carlo (MCMC) can further refine sampling methods by eliminating the rejection stage.

Conclusion

Rejection sampling serves as an efficient algorithm for inference in Bayesian Networks when conditions and relationships among variables are understood thoroughly. The applicability extends beyond academia to real-world problems where probabilistic modeling is essential.

Probabilities and Random Variables

The Concept of Random Variables

Examples of Probabilities

Choosing a YouTube Video

Weather Predictions

Bayesian Inference

Random Variable for Probability

Beta Distribution

Introduction to Beta Distribution

Mathematical Representation

Updating Beliefs

Conjugate Priors

Expectations and Modes

Applications in Multi-Armed Bandit Problems

Conclusion

Adding Random Variables

Introduction

Today, we will discuss the addition of random variables, a fundamental concept in probability theory with profound implications in various fields such as statistics, data science, and machine learning.

Story Background

An engaging story was shared about the history of the lecturer’s family, transitioning into themes of randomness and unpredictability, connecting it thematically to probability and the essence of random variables.

Motivation

Understanding how to add random variables is critical for solving various real-world problems, including zero-sum games and predicting behaviors in uncertain environments.

Key Definitions

Independent and Identically Distributed (IID)

A collection of random variables is said to be IID if:

Discussion: Examples of IID

Consider the following cases: 1. Xi are independent exponential random variables with the same parameter λ. 2. Xi are independent exponential random variables with different parameters λi. 3. X1 = X2 = … = Xn are not independent. 4. Xi are independent binomial random variables with different trials but the same probability of success.

Discussion with peers about these cases illustrates the nuances of independence and identical distribution.

Addition of Random Variables

Let’s denote two random variables X and Y. We wish to understand the properties of the sum Z = X + Y. The probability mass function for Z, assuming X and Y are independent, can be derived as follows:


$$P(Z = n) = \sum_{i=0}^{n} P(X = i) P(Y = n - i)$$

If X and Y are not independent, you would use the joint probability distribution:


$$P(Z = n) = \sum_{i=0}^{n} P(X = i, Y = n - i)$$

Special Properties

1. Sum of Binomials: If X1 ∼ Binomial(n1, p) and X2 ∼ Binomial(n2, p), then X = X1 + X2 ∼ Binomial(n1 + n2, p).
2. Sum of Normals: If X1 ∼ N(μ1, σ12) and X2 ∼ N(μ2, σ22) are independent, then:
X = X1 + X2 ∼ N(μ1 + μ2, σ12 + σ22)
3. Sum of Poissons: If X1 ∼ Poisson(λ1) and X2 ∼ Poisson(λ2), then:
X = X1 + X2 ∼ Poisson(λ1 + λ2)

Central Limit Theorem (CLT)

One of the most beautiful results in probability is the Central Limit Theorem, which states: If X1, X2, …, Xn are IID random variables with finite mean μ and finite variance σ2, then:
$$Z_n = \frac{\sum_{i=1}^{n} X_i - n\mu}{\sigma \sqrt{n}} \xrightarrow{d} N(0, 1)$$
as n → ∞.

Thus, regardless of the initial distribution of the random variables, their sum will approximate a normal distribution given a sufficiently large n.

Applications and Examples

Conclusion

The lecture discussed multiple aspects of adding random variables, particularly focusing on the powerful insights provided by the Central Limit Theorem and its applications in real-world scenarios. The beauty and depth of this topic underline the importance of understanding probability and random processes.

Lecture Notes on Central Limit Theorem and Statistics

Introduction

The lecture addresses the importance of understanding intelligence in the context of decision making with a specific focus on modeling human uncertainty and making optimal decisions with limited information.

Central Limit Theorem (CLT)

Definition

The Central Limit Theorem states that the sum (or average) of a large number of independent and identically distributed (IID) random variables will be approximately normally distributed, regardless of the original distribution of the variables.

Mathematical Representation

Let X1, X2, …, Xn be IID random variables with mean μ and variance σ2. The CLT can be formulated as:
Sn = X1 + X2 + … + Xn ∼ N(nμ, nσ2)
As n → ∞, the distribution of Sn approaches a Normal distribution N(nμ, nσ2).

Application in Dice Rolls

For instance, consider rolling a fair six-sided die. The expected value and variance for a single die roll are:
$$\mu = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5, \quad \sigma^2 = \frac{(1 - 3.5)^2 + (2 - 3.5)^2 + (3 - 3.5)^2 + (4 - 3.5)^2 + (5 - 3.5)^2 + (6 - 3.5)^2}{6} = \frac{35}{12}$$
Thus, when summing n dice rolls:
$$\text{Mean} = n \cdot \mu = 3.5n, \quad \text{Variance} = n \cdot \sigma^2 = n \cdot \frac{35}{12}$$

Estimation and Sampling

Sampling Distribution of the Mean

The sample mean calculated from a sample of n observations will have:
 ∼ N(μ, σ2/n)
This shows that the sampling distribution of the mean is also normally distributed with mean μ and variance σ2/n.

Unbiased Estimation

It is critical to estimate population parameters from sample data. The sample mean is defined as:
$$\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$$
and it is an unbiased estimator of the population mean μ.

Estimating Variance

The sample variance S2 is calculated as:
$$S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$
The divisor n − 1 (Bessel’s correction) is used to obtain an unbiased estimator of the population variance σ2.

Hypothesis Testing

When estimating parameters, we often use confidence intervals. A 95% confidence interval for the mean, using the standard error (SE) of the mean, is calculated as:
$$\bar{X} \pm Z_{\alpha/2} \cdot \text{SE} = \bar{X} \pm Z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}$$
where Zα/2 is the Z-value for the desired confidence level.

Conclusion

The CLT is a cornerstone of statistical theory that enables data analysts to make inferences about population parameters based on sample statistics. It is crucial to understand how to properly estimate these parameters and the limits of those estimations, especially concerning biases and sampling methods.

Bootstrapping and P-Values

Class Overview

During today’s lecture, we will cover several important concepts in probability and statistics:

Central Limit Theorem

The Central Limit Theorem (CLT) states that if you take a sufficiently large sample size n of independent and identically distributed (IID) random variables, the sample mean will be approximately normally distributed regardless of the shape of the original population distribution. Mathematically:
$$\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$$
where is the sample mean, and Xi are the individual random variables.

The variance of the sample mean is given by:
$$\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$$
where σ2 is the population variance.

Estimating Sample Variance

The sample variance s2 can be estimated using the formula:
$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$
The n − 1 term (Bessel’s correction) is used to obtain an unbiased estimator of the population variance.

Bootstrap Method

The bootstrapping method allows estimation of the distribution of a statistic (e.g., mean, variance) by resampling with replacement from the original sample. The steps are as follows:

  1. Take n samples from the population to create a sample of size n.

  2. Construct the empirical probability mass function (PMF) from the sample.

  3. Resample n observations with replacement from the constructed PMF.

  4. Calculate the desired statistic (e.g., sample variance) for each resample.

  5. Repeat the resampling process B times to generate a distribution of the statistic.

P-Values

The P-value is a measure of the strength of the evidence against the null hypothesis. It quantifies how likely it would be to observe data as extreme as what we did, given that the null hypothesis is true.

The P-value can be calculated as follows:

  1. Define the null hypothesis (H0) and alternative hypothesis (H1).

  2. Calculate the observed statistic (e.g., difference in means).

  3. Use bootstrapping to generate a distribution of the statistic under H0.

  4. Determine the proportion of bootstrapped statistics that are more extreme than the observed statistic:
    $$P\text{-value} = \frac{\text{Number of extreme statistics}}{B}$$

Example Application

For example, to test whether there is a difference in means between two groups:

  1. Combine the two groups into a single “universal” distribution under the null hypothesis.

  2. Resample two groups with replacement from this combined distribution.

  3. Calculate the difference in means for these resamples.

  4. Accumulate a distribution of differences.

  5. Compute the P-value based on how many differences are greater than or equal to the observed difference.


$$\begin{aligned} D_{mean} &= \bar{X}_1 - \bar{X}_2 \\ P\text{-value} &= P(D_{sample} \geq D_{observed} \mid H_0 \text{ is true})\end{aligned}$$

Conclusion

Understanding bootstrapping and P-values provides robust tools for statistical inference, allowing researchers to express confidence in their reported statistics. Bootstrapping allows for flexible analyses without relying on strong parametric assumptions.

Expectation and Algorithmic Analysis

Introduction

This lecture focuses on algorithmic analysis, particularly dealing with expectation, law of total expectation, and their implications in algorithm design and analysis. The lecture included various problems, one of which was difficult, as indicated by the instructor.

Expectation

Definition of Expected Value

The expected value (or mean) of a random variable X can be calculated as follows:


E[X] = ∑xx ⋅ P(X = x)

where P(X = x) is the probability that the random variable X takes on the value x.

Linearity of Expectation

One essential property of expectation is linearity. If we have several random variables X1, X2, …, Xn, the expectation of their sum is the sum of their expectations:


$$E\left[\sum_{i=1}^{n} X_i\right] = \sum_{i=1}^{n} E[X_i]$$

This property holds regardless of whether the random variables are independent.

Limitation of Expectation

A limitation of the expected value is that it summarizes a probability distribution into a single number, which may overlook important characteristics of the distribution, such as variance and tail behavior.

Law of Unconscious Statistician

The expected value of a function g(X) of a random variable X can be computed using the following formula:


E[g(X)] = ∑xg(x) ⋅ P(X = x)

This is useful in calculating expectations of functions of random variables.

Indicator Random Variables

An indicator random variable I for an event E is defined as:


$$I = \begin{cases} 1, & \text{if event } E \text{ occurs} \\ 0, & \text{if event } E \text{ does not occur} \end{cases}$$

The expected value of an indicator random variable is equal to the probability of the event:


E[I] = P(E)

Binomial Distribution

A binomial random variable Y can be defined as the sum of n independent Bernoulli trials (indicator random variables):


$$Y = \sum_{i=1}^{n} I_i$$

where each Ii represents a success (1) or failure (0) in a Bernoulli trial with success probability p. The expected value of Y is:


E[Y] = n ⋅ p

Negative Binomial Distribution

The negative binomial distribution counts the number of trials until the rth success. Its expected value is given by:


$$E[Y] = r \cdot \frac{1}{p}$$

where p is the probability of success in each trial.

Computer Cluster Utilization Problem

Consider a computer cluster with K servers and independent request probabilities pi for each server. Let Ai be an indicator variable that represents whether server i is idle. The total number of idle servers X can be expressed as:


$$X = \sum_{i=1}^{K} A_i$$

To find the expected number of servers not idle, Y:


Y = K − X  ⇒  E[Y] = K − E[X]

Furthermore, if Ai is the indicator for an event, we can compute E[Ai] as 1 − pi.

Coupon Collector’s Problem

In the classic "coupon collector’s problem," let X be the number of trials until each of n buckets receives at least one coupon. The expected value of X is given by:


$$E[X] = n \cdot \sum_{k=1}^{n} \frac{1}{k} \approx n \cdot \ln(n) + \gamma n$$

Here, γ is the Euler-Mascheroni constant.

Differential Privacy

Differential privacy is a method of ensuring privacy when dealing with sensitive datasets. The fundamental approach involves adding random noise to the outputs of queries on the dataset. The expected output under differential privacy can be analyzed with the previously discussed laws of expectation.

Conclusion

The lecture covered a wide range of topics related to expectation and algorithmic analysis, with applications in cloud computing and data privacy. The concepts of expected values, indicator random variables, and the linearity of expectation are foundational in algorithmic design.

Introduction to Machine Learning

Introduction

Machine learning is a vital part of artificial intelligence that focuses on learning from data. This lecture explores the foundational concepts of parameter estimation, particularly Maximum Likelihood Estimation (MLE) and its applications in building machine learning models.

Parametric Models

In probabilistic models, parameters define the characteristics of the distribution:

Stages of Machine Learning

  1. Modeling: Define a probabilistic model for the real-world problem.

  2. Training: Estimate parameters using available training data.

  3. Prediction: Use the model with learned parameters for predictions on new data.

Maximum Likelihood Estimation (MLE)

MLE is a method for estimating the parameters of a statistical model. It defines the likelihood of observing the given data under specific parameter values.

Likelihood Function

For independent and identically distributed (IID) data points, the likelihood function L(θ) is given by:
$$L(\theta) = \prod_{i=1}^{n} P(X_i | \theta)$$
where P(Xi|θ) is the probability of observing data point Xi given parameter θ.

Log-Likelihood Function

Taking the logarithm simplifies calculations:
$$\log L(\theta) = \sum_{i=1}^{n} \log P(X_i | \theta)$$

Finding MLE

To find the MLE, we maximize the log-likelihood function:
θ̂ = arg maxθlog L(θ)
This can involve solving:
$$\frac{d}{d\theta} \log L(\theta) = 0$$

Example: Estimating Parameters for Poisson Distribution

Likelihood for Poisson

The probability mass function for a Poisson distribution is:
$$P(X = k | \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}$$
Thus, the log-likelihood is:
$$\log L(\lambda) = \sum_{i=1}^{n} \left( X_i \log \lambda - \lambda - \log(X_i!) \right)$$

For n independent samples, the maximum likelihood estimator for λ is:
$$\hat{\lambda} = \frac{1}{n} \sum_{i=1}^{n} X_i$$

Example: Estimating Parameters for Bernoulli Distribution

The probability mass function for a Bernoulli distribution is:
P(X = x|p) = px(1 − p)1 − x

The log-likelihood becomes:
$$\log L(p) = \sum_{i=1}^{n} \left( X_i \log p + (1 - X_i) \log(1 - p) \right)$$

To find MLE for p:
$$\hat{p} = \frac{1}{n} \sum_{i=1}^{n} X_i$$

General Insights about MLE

Applications of MLE

  1. Identifying parameters in distributions such as Gaussian and Bernoulli.

  2. Generalizing across distributions using algorithms like Expectation-Maximization (EM) for models that include latent variables.

Conclusion

Understanding Maximum Likelihood Estimation provides the mathematical background necessary for various machine learning techniques and helps in building more sophisticated models.

Introduction to Machine Learning

Overview

In this class, we are transitioning from pure probability theory to the fundamentals of machine learning. Our objective is to understand parameter estimation as the core challenge of machine learning, and how this relates to deep learning. The main focus will be on estimating parameters from probabilistic models based on given data.

Parameter Estimation

Parameter estimation is the process of inferring the values of parameters (Θ) in a probabilistic model based on observed data.

Goals:

  1. Model real-world problems using probabilistic models.

  2. Estimate parameters using historical data.

To denote the parameters, we will use the symbol Θ (theta).

Maximum Likelihood Estimation (MLE)

The first significant method for estimating parameters is Maximum Likelihood Estimation (MLE). The idea is to choose parameters that maximize the likelihood of observing the given data.

Likelihood

Given a dataset {x1, x2, …, xn}, the likelihood L(Θ) can be written as:
$$L(\Theta) = P(X|\Theta) = \prod_{i=1}^{n} P(x_i|\Theta)$$

Where P(xi|Θ) is the probability of observing each data point given the parameters Θ.

Log-Likelihood

To simplify calculations, we often work with the log-likelihood, denoted as:
$$\log L(\Theta) = \sum_{i=1}^{n} \log P(x_i|\Theta)$$
Maximizing the log-likelihood is computationally easier than maximizing the likelihood function directly.

Finding the Maximum

To find the maximum, we take the derivative of the log-likelihood with respect to Θ:
$$\frac{d}{d\Theta} \log L(\Theta) = 0$$
The critical points can be found by setting the above derivative to zero.

Challenges with MLE

MLE can lead to overfitting, where the estimated parameters define the model too closely based on the training data, potentially failing on new data.

Gradient Ascent

When maximizing functions, a traditional approach may not always yield an optimal solution efficiently. Gradient ascent provides a method to incrementally reach the maximum:

  1. Start with an initial guess Θ0.

  2. Compute the gradient g of the log-likelihood at that point.

  3. Update the parameters:


Θi + 1 = Θi + αg
Where α is the step size.

Parameter Estimation with Gradient Ascent

The gradient ascent method will lead to the highest point (maximum) in the log-likelihood function. However, if the steps are too large, it might overshoot the maximum.

Bayesian Estimation

Unlike MLE, Bayesian estimation incorporates prior beliefs about the parameters through Maximum A Posteriori (MAP) estimation.

Bayes’ Theorem

To estimate the parameters using Bayes’ theorem, we focus on:
$$P(\Theta|X) = \frac{P(X|\Theta)P(\Theta)}{P(X)}$$
Where: 1. P(Θ|X) is the posterior. 2. P(X|Θ) is the likelihood. 3. P(Θ) is the prior belief.

Maximum A Posteriori (MAP)

To estimate parameters using MAP:
ΘMAP = argmaxΘP(Θ|X) = argmaxΘP(X|Θ)P(Θ)

This is akin to performing MLE while also considering the prior:
log P(Θ|X) = log P(X|Θ) + log P(Θ)

MAP estimation helps mitigate the overfitting issues associated with MLE by incorporating prior distributions for Θ.

Conjugate Priors

In situations where the prior distribution and the likelihood function are of the same family, we say the prior is a conjugate prior. Some common conjugate distributions include:

Example: Estimating Proportions with Bernoulli Trials

Given a Bernoulli trial data of successes and failures, if we want to estimate the probability p, we would use:
P(p|data) ∝ P(data|p)P(p)
Where P(p) can be a beta prior, resulting in a posterior that is also a beta distribution.

Implementation

To compute MAP estimates:

  1. Specify a prior.

  2. Collect data and calculate the likelihood.

  3. Use the log-likelihood expression plus the prior’s log to optimize using gradient ascent or other optimization techniques.

Conclusion

The transition from MLE to MAP and the introduction of priors significantly enhances our ability to make robust parameter estimates that generalize better to new data. This will be foundational as we explore more complex machine learning algorithms in the upcoming classes.

Naive Bayes Classification Notes

Introduction

Naive Bayes is a key machine learning algorithm that will be explored in the course. It serves as the first machine learning algorithm. It is particularly useful for classification tasks.

Conceptual Framework

Machine Learning Overview

Machine learning can be visualized as employing probabilistic models with learned parameters to make predictions based on input data.

Parameter Estimation

The two main approaches for parameter estimation are:

Notation and Variables

Inputs and outputs in classification tasks are denoted using specific notation:

Naive Bayes Classifier

Assumptions

The naive Bayes classifier is based on the powerful assumption of conditional independence of inputs given the output class:
P(X|Y) = P(X1|Y) ⋅ P(X2|Y)⋯P(Xm|Y)
Where X = (X1, X2, …, Xm) are input features.

Bayes’ Theorem

The prediction formula for the naive Bayes classifier can be expressed as:
P(Y|X) ∝ P(Y) ⋅ P(X|Y)
After applying the naive Bayes assumption, we focus on the numerator:
$$P(Y | X) \propto P(Y) \cdot \prod_{i=1}^{m}P(X_i | Y)$$

Parameter Estimation

Parameter estimation for naive Bayes involves:

Prediction

To make a prediction for a new data point:

Applications

A notable application of naive Bayes is in email spam detection. In this scenario:

Performance Metrics

To evaluate the classifier’s performance, standard metrics include:

Conclusion

Naive Bayes is a foundational algorithm for classification that simplifies complex probability calculations through its naive independence assumption. It forms the basis for understanding more complex machine learning methodologies and remains relevant due to its efficiency and effectiveness, particularly in situations with high-dimensional data, such as text classification.

Machine Learning and Logistic Regression

Introduction

Core Concepts

Classification Algorithms

Modeling Process

Parameter Estimation

Maximum Likelihood Estimation (MLE)

Logistic Regression

Logistic Model

Sigmoid Function

Model Formulation

Training the Logistic Regression Model

Cost Function

Optimization via Gradient Ascent

Predictions

Conclusion

Deep Learning Notes

Introduction

Deep learning represents a subset of machine learning focused on algorithms that learn from data in a hierarchical manner, inspired by the structure of the human brain.

Learning Objectives

Core Concepts

Logistic Regression

Logistic regression is a fundamental algorithm in supervision classification where:

Neural Networks

Forward Pass

The forward pass through a neural network involves:


yhat = σ(WH+b)

Training and Backpropagation

To train a neural network, we need to optimize the weights through a process known as backpropagation, which relies on:

Chain Rule and Gradients

In backpropagation:

Applications of Deep Learning

Conclusion

Deep learning enables us to create powerful predictive models by stacking layers of neurons (logistic regressions) and training them using large datasets. It is becoming ubiquitous in applications ranging from healthcare to art creation. Understanding the foundations, such as logistic regression and how neural networks operate, is key to advancing in this field.

Questions for Further Study

To delve deeper into deep learning:

Ethics and Machine Learning: Detailed Notes

Introduction

Today, we are diving into the important topic of the probability behind ethics in machine learning. Understanding machine learning is crucial for addressing the impacts and implications of smarter algorithms on society.

Personal Ethics and Human Goodness

AI and Its Impact on Society

AI influences multiple aspects of our daily lives, such as:

The dual implications of AI:

  1. It holds great potential for societal benefits (e.g., better healthcare, intelligent electricity grids, access to quality education).

  2. It can also manifest in wayward applications that exhibit biases, calling for ethical considerations.

Learning Goals

By the end of this lecture, you should:
Understand the limitations of “fairness through unawareness.”
Recognize measurement techniques for fairness, specifically two measurement frameworks:

Familiarize yourself with techniques to mitigate fairness issues.

Key Concepts in Fairness

1. Protected Demographic:
- Groups that are safeguarded against discrimination based on characteristics such as race, age, gender, etc.

2. Distributive vs. Quality of Service Harm:
- Quality of Service Harm: AI performs better for certain demographic groups than others (e.g., inaccurate facial recognition).
- Distributive Harm: AI impacts critical opportunities (job applications, loans) unfairly based on demographics, often resulting in more severe consequences.

3. Procedural vs. Distributive Fairness:
- Procedural fairness focuses on the fairness of the process.
- Distributive fairness emphasizes fairness in the outcomes.

Machine Learning Basics

- Classification algorithms, like: - Naive Bayes - Logistic Regression

Using logistic regression: - Logistic regression predicts outcomes using a sigmoid function:
$$g(z) = \frac{1}{1 + e^{-z}}$$
where, if z > 0, predict 1; if z < 0, predict 0. - It creates a straight line (hyperplane) that separates classes, limiting its effectiveness with non-linearly separable data.

Fairness Definitions

  1. Fairness through Unawareness: Excluding demographic features to make decisions.
    Example: Not considering race when making decisions.

  2. Fairness through Awareness: Taking demographics into account for decisions but ensuring equality in outcomes.

Parity: The probability of a positive prediction should be the same across demographics:
P( = 1|D = d0) = P( = 1|D = d1)
Calibration: The prediction accuracy should be equal across demographics:
P( = T|D = d) = P( = T)

Case Studies of Harm in AI

  1. Quality of Service Harm: Cameras malfunctioning for specific demographics, rendering poor autofocus.
    Caused by training data biases.

  2. Distributive Harm Case Study: An algorithm used for college admissions that inadvertently favored demographic characteristics, hindering female and non-Caucasian applicants.

Techniques for Mitigating Fairness Issues

Ethics in Technology

Final Thoughts

Conclusion

Please take time to reflect on the ethical implications of your work and the constructs of fairness in machine learning. Your awareness and consideration can shape the future of technology harmoniously.

Generative Models and Neural Networks

Generative Models Overview

Generative models learn to generate data similar to the training set, allowing for applications such as text synthesis and image generation.

Models Discussed

  1. DALL-E: Generates images from textual descriptions.

  2. GPT-3: Generates text based on given prompts.

Key Concepts

Image Generation with DALL-E

DALL-E uses a neural network to create images from textual descriptions, leveraging the concept of denoising.

Denoising Process

The key idea is to start with a heavily noised image and progressively remove noise through a neural network.

Adding Gaussian Noise

To manipulate images, Gaussian noise is added:
Noisy_Image = Original_Image + 𝒩(0, σ2)
where 𝒩(0, σ2) is Gaussian noise with mean 0 and variance σ2.

Training the Neural Network

The neural network is trained to minimize the difference between the predicted noise and the actual noise, leading to the minimization of the sum of squared errors:
Loss(θ) = ∑i(yi − i)2
where yi is the true noise and i is the predicted noise.

Expected Output

The process is repeated several times to generate an image that resembles the target:

  1. Start with completely noisy image.

  2. Apply the trained denoising network multiple times.

  3. Each iteration reduces the noise.

Text Generation with GPT-3

GPT-3 uses a different but related approach to generate coherent and contextually relevant text.

Attention Mechanism

The attention mechanism allows the model to weigh the importance of different words in the context:


$$\text{Attention}(Q, K, V) = \text{softmax} \left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
where Q (queries), K (keys), and V (values) are matrices derived from input words, and dk is the dimensionality of the keys.

Generating Text

The text generation process can be broken down into:

1. Modeling the probability of the next word given previous words:
P(wt|w1, w2, …, wt − 1)

2. Selecting the next word based on the probability distribution obtained:
Sample from P(wt)

Using a sampling strategy like sampling from the predicted probability distribution helps maintain variability in output.

Training the Model

Like DALL-E, GPT-3 is trained using MLE, focusing on maximizing the likelihood of observing the actual next word in the training set relative to the predicted word.

Conclusion

In today’s lecture, we explored the foundational ideas behind two significant generative models: DALL-E and GPT-3.

Final Thoughts

Generative models are shaping the future of AI with applications spanning multiple domains. Our understanding of these models will play a pivotal role in developing future technologies.

Final Summary

Course Overview

Ensure you have access to practice final exams and resources. During the course, we covered several topics:

Key Takeaways

Research Insights

Reflection on the potential in research:

Problem Scotts and Vignettes

A vignette was shared about using similarity distance to select representative assignments from a pool of 300 solutions. The goal is to minimize the distance between selected and unselected solutions.

Algorithmic Improvement

Thompson Sampling

Thompson Sampling is a probabilistic model that can reduce complexity from O(n2) to O(nlog n) by evaluating uncertainty through sampling methods.


P(A|B) ∝ P(B|A)P(A)

The algorithm allows for effective exploration of best candidates by using belief distributions over possible outcomes.

Quality of Education

A major theme discussed was the gap in quality education globally. The drive towards improving educational quality can leverage the increasing access to technology (e.g., smartphones) and rich datasets from platforms like Code.org.

Significant Challenges in Feedback and Grading

A focus was given to the difficulties in providing feedback, especially in coding assignments:

Feedback Processes

The idea of collecting a vast amount of labeled data to inform feedback algorithms can significantly impact students’ learning trajectories. Methods explored include:

Bayesian Approach to Learning

Bayesian Models allow for:

Probabilistic Graphical Models

The course introduced complex graphical models for better representation and inference in relation to random variables.

Given two random variables X and Y:
P(X, Y) = P(X|Y)P(Y)

Exploring the relationships between variables can lead to insights in clustering student submissions, reducing redundancy in sample evaluations, and enabling more personalized feedback.

Final Thoughts on Future Directions

Consider the following classes that follow CS 109:

Intersectionality in Research

Encouragement to find unique problem intersections based on personal lived experiences, interests, and academic backgrounds.

Conclusion

The lecture concluded with gratitude expressed towards the students for their engagement throughout the course. It emphasized the importance of continuous exploration in both academic and personal paths.

“You live in interesting times — a time full of potential discoveries and opportunities.”

Final Review Notes

Reading Probability Questions

Beta Distribution

Adding Independent Random Variables

Central Limit Theorem

Sampling, Sample Mean, and Sample Variance

Parameter Estimation

Maximum Likelihood Estimation (MLE)

Bayesian Estimation (MAP)

Bootstrapping

Naive Bayes

Logistic Regression

Common Questions and Problems

  1. For problems involving multiple parameters, derive partial derivatives for each parameter.

  2. Expect questions about applying MLE and MAP in practical scenarios, including parameter estimation and hypothesis testing.

Conclusion