Counting provides the foundational basis for understanding probability.
Essential for solving complex problems in computer science and AI.
An experiment leads to various outcomes.
Example: Rolling a die has outcomes: 1, 2, 3, 4, 5, 6.
If an experiment consists of multiple steps, the total number of outcomes is the product of outcomes at each step.
N = n1 × n2 × … × nk
where ni is the number of outcomes in step i.
Total outcomes = 6 × 6 = 36.
If outcomes can come from two or more events (sets), the total number of outcomes equals the sum of outcomes from each event minus any overlaps.
N = |A| + |B| − |A ∩ B|
where |A| = size of set A, |B| = size of set B.
How many unique arrangements of the letters "boba"?
Challenge: How to calculate the arrangements when letters repeat.
General formula for unique arrangements of a multiset:
$$N = \frac{n!}{n_1! \times n_2! \times \ldots \times n_k!}$$
where n is the total number of items and ni is the count of each unique item.
$$\text{Total Letters} = 4 \quad (b, o, b, a) \\
\text{Arrangements} = \frac{4!}{2!} = \frac{24}{2} = 12$$
Understanding counting is essential for grasping probability.
Concepts covered include basic counting principles, arrangement calculations, and real-world applications in AI and algorithms.
Remember to participate, practice, and review material regularly.
In this lecture, we focus on the essential rules of counting, which serve as the foundation for probability theory and combinatorial mathematics. Key concepts include permutations, combinations, and methods for counting arrangements of indistinguishable objects.
The two core rules of counting are:
Multiplication Rule: If an experiment can be broken down into a series of steps where the number of outcomes at each step does not depend on previous outcomes, the total number of outcomes is the product of the number of outcomes at each step.
n = n1 × n2 × … × nk
where n is the total number of outcomes, and ni are the number of outcomes for each step.
Addition Rule: If an experiment can result in outcomes from two or more distinct groups, the total number of outcomes is the sum of the outcomes from each group, minus any overlaps.
n = nA + nB − nA ∩ B
A permutation refers to an ordered arrangement of objects.
The number of permutations of n distinct objects is given by:
P(n) = n!
For example, for the letters in "CHRIS":
P(5) = 5! = 120
If some objects are indistinct, the formula for permutations changes to account for the indistinguishable items.
The general formula for n objects where there are groups of indistinguishable objects is:
$$P(n; n_1, n_2, \ldots, n_k) = \frac{n!}{n_1! \cdot n_2! \cdot \ldots \cdot n_k!}$$
For example, for the word "MISSISSIPPI":
$$P = \frac{11!}{4! \cdot 4! \cdot 2!}$$
where there are four ’I’s, four ’S’s, and two ’P’s.
A combination refers to the selection of objects without regard to the order.
If we want to choose k objects from n distinct objects, we utilize the combinations formula:
$$C(n, k) = \frac{n!}{k!(n-k)!}$$
For example, to choose 3 books from 6:
$$C(6, 3) = \frac{6!}{3! \cdot 3!} = 20$$
When dealing with indistinguishable objects, the concept is similar, but the result will count each unique grouping once. For example, choosing 3 indistinguishable items from a set where items may overlap is treated carefully.
If you have n indistinct objects to put into r distinct buckets, the formula is given by:
$$\text{Ways} = \frac{(n + r - 1)!}{n! \cdot (r - 1)!}$$
This is derived from the method of "stars and bars" or dividers.
To place 4 indistinguishable objects into 3 distinct buckets:
$$\text{Ways} = \frac{(4 + 3 - 1)!}{4! \cdot (3 - 1)!} = \frac{6!}{4! \cdot 2!} = 15$$
Understanding these foundational counting principles is crucial for tackling more complex problems in probability and combinatorics. Future classes will build on these concepts, particularly as we apply them to probability theory.
This course marks the beginning of our journey into probability theory, specifically tailored for computer scientists. Get ready to delve deeply into the world of probabilities!
In prior classes, we covered advanced counting techniques:
Permutations: The number of ways to arrange n distinct items:
P(n) = n!
Combinations: The number of ways to choose k items from n:
$$C(n, k) = \frac{n!}{k!(n-k)!}$$
Distributions (Bucketing): Number of ways to distribute n distinct items into k buckets.
The sample space is the set of all possible outcomes of an experiment.
Example: Flipping a coin yields S = {Heads, Tails}
Size of sample space can be finite or infinite.
An event is a subset of the sample space that satisfies a particular condition.
Example: If S = {1, 2, 3, 4, 5, 6} (a die roll), then an event for rolling a number less than 4 would be E = {1, 2, 3}.
The probability of an event E occurring is denoted P(E) and is defined as:
$$P(E) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}} \quad \text{(if equally likely)}$$
Where P(E) must satisfy:
0 ≤ P(E) ≤ 1
Non-negativity: P(E) ≥ 0
Normalization: P(S) = 1
Additivity: For mutually exclusive events A and B,
P(A ∪ B) = P(A) + P(B)
When all outcomes in the sample space are equally likely, the probability of an event E is given by:
$$P(E) = \frac{|E|}{|S|}$$
where |E| is the number of outcomes in event E and |S| is the number of outcomes in the sample space.
$$P(\text{sum} = 7) = \frac{6}{36} = \frac{1}{6}$$
Here, we count the pairs: (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1).
Consider 4 cows and 3 pigs, we want to find the probability of drawing 1 cow and 2 pigs.
Sample Space: C(7, 3)
Event Space: Choose 1 cow from 4, and choose 2 pigs from 3:
$$P(\text{1 cow, 2 pigs}) = \frac{C(4, 1) \times C(3, 2)}{C(7, 3)} = \frac{4 \times 3}{35} = \frac{12}{35}$$
To calculate the probability of drawing a straight hand from a deck:
Sample Space: C(52, 5)
Event Space Count: Count valid straights: - Choose starting card value: 10 (e.g., (2, 3, 4, 5, 6)) - Choose suits for each card (45):
$$P(\text{Straight}) = \frac{10 \cdot 4^5}{C(52, 5)}$$
With N chips, one defective, and K chosen:
$$P(\text{defective in sample}) = \frac{K}{N}$$
Probability theory is a profound language that helps us quantify uncertainty and understand the randomness inherent in our world.
Last week focused on combinatorics and the foundations of probability. We covered:
Counting orderings of objects
Calculating combinations of objects
The concept of permutations which can lead to large numbers.
For example, arranging 52 unique cards yields 52! (52 factorial). This leads to the understanding that many arrangements are readily available, likely leading to unique outcomes.
We discussed the three fundamental axioms of probability:
P(E) ≥ 0 for any event E
P(S) = 1 where S is the sample space
For any two mutually exclusive events E and F:
P(E ∪ F) = P(E) + P(F)
What this means is that if no outcomes overlap between E and F, the probability of E or F happening can be calculated by summing their probabilities.
A key concept in probability is conditional probability, denoted as:
$$P(E \mid F) = \frac{P(E \cap F)}{P(F)}$$
where P(E ∣ F) is the probability of event E given F has occurred.
If two events E and F are mutually exclusive, then:
P(E ∪ F) = P(E) + P(F)
We can apply these concepts to practical situations, illustrating through examples such as card drawing.
The law of total probability states:
P(E) = P(E ∣ F)P(F) + P(E ∣ Fc)P(Fc)
This allows for assessing P(E) based on conditional probabilities related to disjoint partitioning events F and Fc.
Once we establish a conditional probability, we can explore probabilities in reverse using Bayes’ Theorem:
$$P(F \mid E) = \frac{P(E \mid F) P(F)}{P(E)}$$
This allows us to compute the probability of an event given new evidence.
Consider a medical test: - The probability of having the disease = P(D) - The probability of testing positive given you have the disease = P(T ∣ D) - The probability of testing positive given you do not have the disease = P(T ∣ Dc)
An example using these probabilities illustrates practical usage:
$$P(D \mid T) = \frac{P(T \mid D) P(D)}{P(T)}$$
When we observe an event: - Our a priori beliefs (prior probabilities) can be updated based on evidence. - This process increases our methodological robustness in decision-making.
In summary, you’ve acquired tools for dealing with uncertainty through the establishment of probabilities, including:
Combined events and their probabilities
The relationship between conditional probabilities
The application of Bayes’ Theorem in practical scenarios
Spend some time working through these principles, as they are essential for understanding both theoretical and applied statistics and probability. Remember to practice using Bayes’ Theorem as it is crucial in many real-world applications.
In this lecture, we delve into the concept of independence, which is critical for understanding probability theory and building algorithms that make decisions under uncertainty.
For two events A and B, we can ask:
What is the probability that both A and B happen? (P(A ∩ B))
What is the probability that either A or B happens? (P(A ∪ B))
What is the probability of A given B has occurred? (P(A|B))
The definition of conditional probability states:
$$P(A | B) = \frac{P(A \cap B)}{P(B)}$$
Rearranging this gives us the Chain Rule:
P(A ∩ B) = P(A) ⋅ P(B|A)
This formula is fundamental in probability.
The Law of Total Probability states:
P(A) = ∑iP(A|Bi)P(Bi)
where Bi partitions the sample space.
Bayes’ theorem relates conditional probabilities:
$$P(B | A) = \frac{P(A | B) P(B)}{P(A)}$$
Two events A and B are mutually exclusive if they cannot occur at the same time:
P(A ∩ B) = 0
If they are mutually exclusive, the probability of either event occurring is:
P(A ∪ B) = P(A) + P(B)
Two events A and B are independent if the occurrence of one does not affect the other:
P(A|B) = P(A)
This also means:
P(A ∩ B) = P(A) ⋅ P(B)
For n independent events E1, E2, …, En, the probability that all occur is:
P(E1 ∩ E2 ∩ … ∩ En) = P(E1) ⋅ P(E2)⋯P(En)
When flipping n coins with probability p of heads, the probability of exactly k heads is:
$$P(k \text{ heads}) = \binom{n}{k} p^k (1-p)^{n-k}$$
If we flip 10 coins and each coin has a probability of 0.6 for heads, what is the probability of getting exactly 6 heads?
$$P(6 \text{ heads}) = \binom{10}{6} (0.6)^6 (0.4)^4$$
If independent routers have various probabilities of functioning, the probability that at least one router is functioning can be calculated as:
$$1 - P(\text{no router functioning}) = 1 - \prod_{i=1}^{n} (1 - p_i)$$
Understanding independence and mutual exclusivity are crucial skills for computing probabilities efficiently. The definitions and calculations introduced in this lecture lay the groundwork for solving more complex probability scenarios.
The lecture began with a warm welcome and an overview of logistics concerning problem sets and upcoming workshops. An introduction to key concepts in probability was provided, setting the stage for discussions on Bayes’ Theorem, random variables, and expectations.
Bayes’ Theorem is a fundamental concept in probability that describes how to update the probability estimate for a hypothesis as more evidence is acquired. It is expressed as:
$$P(H | E) = \frac{P(E | H) P(H)}{P(E)}$$
where:
P(H|E) is the posterior probability,
P(E|H) is the likelihood,
P(H) is the prior probability,
P(E) is the marginal likelihood.
The denominator of Bayes’ Theorem can be expressed using the Law of Total Probability:
P(E) = ∑iP(E|Hi)P(Hi)
This allows for the calculation of probabilities over multiple possible outcomes.
In the context of the problem sets, students were tasked with programming a solution involving Bayes’ Theorem, exploring more generalized scenarios with multiple outcomes rather than binary situations.
A random variable X is a variable whose possible values are numerical outcomes of a random phenomenon. Random variables can be discrete or continuous.
A discrete random variable has a associated probability mass function P(X = x) that provides the probability that the random variable takes on the value x. The PMF must satisfy:
∑xP(X = x) = 1
The expectation (or mean) of a random variable X is a measure of the central tendency of the random variable, defined as:
E[X] = ∑xxP(X = x)
This value may not correspond to a possible outcome of the random variable but gives a central measure.
For a fair six-sided die, the expectation is calculated as follows:
$$E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5$$
Two events E and F are conditionally independent given a third event G if:
P(E ∩ F|G) = P(E|G)P(F|G)
This means that information about event G does not change the dependence between E and F.
It is important to note the distinction between events and random variables:
Events are the outcomes or the occurrences that can be true or false.
Random Variables are numerical quantities assigned to the outcomes of random phenomena.
The lecture concluded with a challenge involving a game based on coin flips and expected winnings, leading to discussions around the implications of expectations when probabilistic outcomes diverge significantly.
Students were encouraged to ask questions and discuss insights gained from the problem sets and assignments.
By the end of this lecture, you should be able to:
Recognize a binomial random variable and solve related probability problems.
Understand Bernoulli random variables and their applications.
Calculate the variance of a random variable.
A random variable is a numerical outcome of a probabilistic process. It can take various outcomes, each with an associated probability. The relationship between these outcomes and their probabilities is captured by the probability mass function (PMF).
P(X = x) = PMF(x)
where X is a random variable, x is a particular outcome.
The expectation (mean) of a random variable X is defined as:
E[X] = ∑xx ⋅ P(X = x)
Key properties of expectation:
Linearity of expectation: E[aX + bY] = aE[X] + bE[Y]
When X, Y are independent, E[X + Y] = E[X] + E[Y]
A Binomial random variable models the number of successes in n independent Bernoulli trials, each with a success probability p.
The PMF of a binomially distributed random variable X is given by:
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$
where $\binom{n}{k}$ is the binomial coefficient.
Flipping a coin n times and counting the number of heads (success).
The number of voters that support a candidate given n voters and a probability p of support.
A Bernoulli random variable is a special case of the binomial distribution where n = 1.
It takes on two values:
$$X = \begin{cases}
1 & \text{with probability } p \\
0 & \text{with probability } 1-p
\end{cases}$$
E[X] = p
The variance of a random variable gives a measure of how spread out the values are. The formula for variance Var(X) is defined as:
Var(X) = E[(X − μ)2]
where μ = E[X].
Using the law of unconscious statistician:
Var(X) = E[X2] − (E[X])2
where
E[X2] = ∑xx2 ⋅ P(X = x)
In today’s lecture, we covered the fundamentals of random variables, focusing on binomial and Bernoulli distributions, and calculated their expectations and variances. These concepts form the bedrock of probability theory, allowing for better understanding of data in future sessions.
In today’s class, we will explore the Poisson distribution, a fundamental concept in probability theory that is widely applicable in fields such as epidemiology, finance, and environmental science.
A random variable is a variable that takes on values probabilistically, inheriting all the associated mathematical properties established by previous mathematicians.
We primarily distinguish between:
Discrete Random Variables: Countable outcomes with associated probabilities.
Continuous Random Variables: Outcomes in a continuous range.
We have previously discussed discrete random variables, particularly the Binomial distribution, which models the number of successes in independent trials.
Let n be the number of trials, and p the probability of success on each trial. The Binomial distribution models the number of successes, denoted as X ∼ Binomial(n, p).
The probability of exactly k successes in n trials is given by:
$$P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}$$
where the binomial coefficient is defined as:
$$\binom{n}{k} = \frac{n!}{k!(n - k)!}$$
For a Binomial random variable:
Expectation: E[X] = np
Variance: Var(X) = np(1 − p)
The Poisson distribution models the number of occurrences of an event over a fixed interval of time or space given a known average rate λ.
A random variable Y is Poisson distributed if:
Y ∼ Poisson(λ)
where λ is the average rate of occurrences.
The Poisson probability mass function is given by:
$$P(Y = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$
for k = 0, 1, 2, ….
When the number of trials n is large, and the probability of success p is small such that λ = n ⋅ p remains moderate, the Poisson distribution can be used as an approximation for the Binomial distribution.
For a Poisson random variable:
Expectation: E[Y] = λ
Variance: Var(Y) = λ
The Poisson distribution applies to various real-world scenarios, often where events occur independently over time, such as:
Incoming calls to a call center
The number of emails received in an hour
Natural phenomena like earthquakes
In our class, we analyzed data from the HerDat Database, which records hurricanes in the Atlantic. The goal was to determine the average number of hurricanes per year.
The average rate λ was calculated as follows:
$$\lambda = \frac{\text{Total Hurricanes}}{\text{Number of Years}}$$
Based on recorded data from 1851 to 1966: - Total Hurricanes: 975 - Number of Years: 115 - $\lambda = \frac{975}{115} \approx 8.5$
With λ = 8.5, we can model the number of hurricanes in a given year using a Poisson distribution:
$$P(Y = k) = \frac{8.5^k e^{-8.5}}{k!}$$
We can determine the probability of having more than 15 hurricanes in a given year.
$$P(Y > 15) = 1 - P(Y \leq 15) = 1 - \sum_{k=0}^{15} P(Y = k)$$
From historical data, the probability of exceeding 15 hurricanes in a year was found to be 1.3%.
Post-1966, the data indicated years with 30 hurricanes, which were improbable under the initial Poisson model. The new analysis suggested a different Poisson distribution with potentially higher rates due to advancements in hurricane detection.
Today’s exploration of the Poisson distribution showcased its utility in modeling rare events and highlighted the importance of recognizing underlying assumptions. Further insights were gathered through the analysis of hurricane data, which indicated a potential shift in hurricane frequency and intensity.
In this class, we will explore the concept of continuous random variables, building on prior knowledge of discrete random variables. Key topics will include various types of distributions, their properties, and the mathematical tools used to analyze them.
A random variable is a function that associates a numerical value with each outcome in a sample space.
They can be discrete (taking on specific values) or continuous (taking on an infinite number of values within a range).
1. Binomial Random Variable Represents the number of successes in n independent Bernoulli trials.
Probability Mass Function (PMF):
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, \ldots, n$$
Expectation:
E[X] = np
Variance:
Var[X] = np(1 − p)
2. Poisson Random Variable Models the number of events occurring in a fixed interval of time/space.
PMF:
$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$
Expectation:
E[X] = λ
Variance:
Var[X] = λ
1. Geometric Random Variable Models the number of trials until the first success.
PMF:
P(X = k) = (1 − p)k − 1p, k = 1, 2, …
Expectation:
$$E[X] = \frac{1}{p}$$
Variance:
$$Var[X] = \frac{1-p}{p^2}$$
2. Negative Binomial Random Variable Represents the number of trials until r successes.
PMF:
$$P(X = n) = \binom{n-1}{r-1} p^r (1-p)^{n-r}, \quad n = r, r+1, r+2, \ldots$$
Expectation:
$$E[X] = \frac{r}{p}$$
Variance:
$$Var[X] = \frac{r(1-p)}{p^2}$$
1. Introduction
Continuous random variables can take on an infinite number of values within a range.
Key concept: Probability Density Function (PDF).
2. Probability Density Function (PDF) The PDF, f(x), gives the relative likelihood for the random variable to take on a given value.
Areas under the PDF represent probabilities:
P(a < X < b) = ∫abf(x) dx
Total area under the PDF equals 1:
∫ − ∞∞f(x) dx = 1
3. Uniform Distribution
All outcomes are equally likely between a and b.
PDF:
$$f(x) = \begin{cases}
\frac{1}{b-a}, & \text{if } a \leq x \leq b \\
0, & \text{otherwise}
\end{cases}$$
4. Exponential Distribution
Models the time until an event occurs (e.g., waiting time until an earthquake).
PDF:
f(x) = λe − λx, x ≥ 0
Expectation:
$$E[X] = \frac{1}{\lambda}$$
Variance:
$$Var[X] = \frac{1}{\lambda^2}$$
1. Using the PDF to Calculate Probabilities - Probability that X exceeds a value can be derived using:
P(X > x) = 1 − P(X ≤ x) = 1 − ∫ − ∞xf(t) dt
2. Cumulative Distribution Function (CDF) - CDF, F(x), is the probability that the random variable is less than or equal to a certain value:
F(x) = P(X ≤ x) = ∫ − ∞xf(t) dt
3. Conversion from PDF to CDF - To compute a desired probability, leverage the CDF when possible:
P(a < X < b) = F(b) − F(a)
Understanding continuous random variables and their corresponding properties is essential in applied statistics and fields such as machine learning. This framework allows for the analysis of real-world phenomena where predictions and assessments rely on an understanding of distributions, expectations, and variances.
Recall that random variables can be classified into:
Discrete Random Variables: Take on distinct values (e.g., number of coin flips).
Continuous Random Variables: Take on any value within a range (e.g., height, weight).
For continuous random variables:
Direct probability measurements (like P(X = x)) are not applicable.
Instead, we consider the probability of occurrence within an interval, so
P(a < X < b) = ∫abfX(x) dx where fX(x) is the probability density function (PDF).
The Normal Distribution, also known as the Gaussian distribution, is defined by its probability density function:
$$f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$
where:
μ is the mean (location of the center).
σ2 is the variance (width of the bell curve).
The support of the normal distribution is from − ∞ to ∞.
Mean μ and variance σ2 completely characterize the normal distribution.
The expected value (mean) is:
E(X) = μ
The variance is:
Var(X) = σ2
The CDF for the normal distribution can be expressed as:
F(x) = P(X ≤ x) = ∫ − ∞xf(t) dt
Since it has no closed form, it is often approximated or referenced through standardized values.
A special case of the normal distribution is the standard normal distribution which has μ = 0 and σ2 = 1. The PDF simplifies to:
$$f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}}$$
To convert any normal variable to the standard normal form, we apply:
$$Z = \frac{X - \mu}{\sigma}$$
Probability problems involving the normal distribution generally follow these steps:
Define the random variable.
Determine its mean and standard deviation.
Utilize the Z-score to convert to standard normal.
Reference the CDF, either through tables or computational tools.
For discrete distributions approximated by continuous distributions (e.g., Binomial approximated by Normal):
A continuity correction should be applied, usually adding or subtracting 0.5.
Many problems can be solved through computer simulation using the Monte Carlo method or direct sampling from the distributions using software tools that provide standard normal computation.
The Normal Distribution is a foundational concept in statistics and used in various fields, from natural phenomena to data science. Its mastery is crucial for understanding more advanced topics in probability and statistics.
In this session, we explored exciting concepts in probability, focusing on the transition from basic probability theory to solving real-world problems using probabilistic models. We also discussed a historical mystery related to the authorship of the Federalist Papers.
The normal distribution is a fundamental concept in probability that has applications in various real-world situations.
The normal distribution is defined by its probability density function, given by:
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$
where:
μ is the mean,
σ2 is the variance,
e is the base of the natural logarithm.
The cumulative distribution function (CDF) is defined as:
$$F(x) = P(X \leq x) = \Phi\left(\frac{x - \mu}{\sigma}\right)$$
where Φ represents the CDF of the standard normal distribution.
To find the probability that a normally distributed random variable X is less than some value x, we use the formula:
$$P(X < x) = \Phi\left(\frac{x - \mu}{\sigma}\right)$$
where μ is the mean and σ is the standard deviation.
When dealing with two random variables X and Y, the joint probability mass function (PMF) is represented as P(X = x, Y = y). It provides complete information about the probabilities of different outcomes.
The marginal probability of X can be obtained by summing over all possible values of Y:
P(X = x) = ∑yP(X = x, Y = y)
This technique allows us to extract information about one variable while ignoring the other.
For discrete random variables, a joint probability table can be created to summarize the probabilities of all combinations of values taken by X and Y.
The multinomial distribution generalizes the binomial distribution for cases with more than two outcomes. It describes the probabilities of counts of k different outcomes observed in n trials.
The multinomial PMF is given by:
$$P(X_1 = k_1, X_2 = k_2, \ldots, X_k = k_k) = \frac{n!}{k_1! k_2! \cdots k_k!} p_1^{k_1} p_2^{k_2} \cdots p_k^{k_k}$$
where pi is the probability of outcome i occurring, and ki is the count of those outcomes.
In investigating who authored the Federalist Papers, specifically Paper 53, we can model the problem using the multinomial distribution.
We collect the text from the Federalist Papers and text segments known to be written by Hamilton and Madison. The goal is to use word counts to derive probabilities.
We want to find the probability of an author given the document:
$$P(H | D) = \frac{P(D | H)P(H)}{P(D)}$$
This leads to the ratio:
$$\frac{P(D | H)P(H)}{P(D | M)P(M)}$$
The multinomial model allows us to compute P(D|H) and P(D|M) efficiently based on word counts observed in the text.
In summary, the session covered fundamental probability concepts including the normal distribution, marginalization in joint distributions, and the multinomial distribution. We also discussed an intriguing application involving the Federalist Papers, demonstrating the power of probabilistic models in solving historical questions.
Importance of filling out midterm attendance forms for seating arrangements.
Today’s topic: inference and belief updates using probability mass functions (PMF) and probability density functions (PDF).
A fun narrative to illustrate the importance of probability and inference.
Inference is formally about how we adjust our beliefs regarding random variables upon observing new evidence. The equations involved include:
$$\begin{aligned}
P(A | B) &= \frac{P(A, B)}{P(B)}\end{aligned}$$
where P(A|B) is the conditional probability of event A given event B.
Understanding multiple random variables interacting is vital. For instance, if we have two random variables X and Y, their joint distribution is represented as:
$$\begin{aligned}
P(X, Y) = P(X | Y) \cdot P(Y)\end{aligned}$$
Bayes’ theorem is fundamental in updating our beliefs. It can be stated as:
$$\begin{aligned}
P(H | D) = \frac{P(D | H) \cdot P(H)}{P(D)}\end{aligned}$$
where:
H: Hypothesis (e.g., author of a document)
D: Data (e.g., content of the document)
P(D|H): Likelihood of data given the hypothesis
P(H): Prior belief in the hypothesis
P(D): Normalization constant
A critical insight is understanding that for continuous random variables, the probability at a precise value is zero. Instead, we consider ranges or density functions. For a continuous random variable:
$$\begin{aligned}
P(X = x) = 0 \quad \text{for any specific } x\end{aligned}$$
However, we can define:
$$\begin{aligned}
P(a < X < b) = \int_a^b f_X(x) \, dx\end{aligned}$$
where fX(x) is the probability density function.
When working with the probability density function:
$$\begin{aligned}
P(X = x) \approx f_X(x) \cdot \epsilon\end{aligned}$$
where ϵ represents a very small number.
We utilized a hypothetical situation involving elephant weights to illustrate Bayesian inference with continuous and discrete probability:
$$\begin{aligned}
P(G = 1 | X = 163) = \frac{P(X = 163 | G = 1) \cdot P(G = 1)}{P(X = 163)}\end{aligned}$$
where G = 1 signifies a female elephant.
For the weights:
$$\begin{aligned}
X | G=1 \sim \mathcal{N}(\mu_1, \sigma_1^2) \quad \text{and} \quad X | G=0 \sim \mathcal{N}(\mu_0, \sigma_0^2)\end{aligned}$$
where you would compute P(X = 163|G = 1) using the normal PDF:
$$\begin{aligned}
f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}\end{aligned}$$
Understanding inference involves grasping how to update our beliefs using joint distributions and applying Bayes’ theorem effectively. Continuous variables require special consideration, especially in terms of using density functions instead of probabilities.
We will explore practical applications of this mathematics in real-world scenarios, specifically regarding the Stanford Acuity Test, an eye test that employs inference principles to update beliefs about visual acuity.
In this lecture, we explore Bayesian inference in the context of the Stanford Eye Test, where we estimate an individual’s visual acuity based on their responses to letters of varying sizes.
Bayesian inference allows us to update our beliefs about a random variable based on new evidence. The fundamental tool we use is Bayes’ Theorem:
$$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$$
Where:
P(A|B) is the posterior probability: the probability of hypothesis A given the evidence B.
P(B|A) is the likelihood: the probability of observing evidence B given A.
P(A) is the prior probability of A.
P(B) is the evidence or marginal likelihood of B.
This theorem enables us to update our probability beliefs as we gain more evidence.
The Stanford Eye Test involves presenting letters of varying sizes to a subject and recording whether their responses are correct or not. Given a letter size s and a response y (correct/incorrect), we want to update our belief about the subject’s visual acuity A:
$$P(A | Y = y, S = s) = \frac{P(Y = y | A, S = s) P(A)}{P(Y = y | S = s)}$$
The prior belief about the ability to see can be represented in various forms:
As a probability distribution over acuity levels.
As a lookup table (or dictionary) in programming, mapping acuity values to their probabilities.
The likelihood term P(Y = y|A, S = s) is critical. It describes how likely an observation (getting a specific letter correct or incorrect) is given the subject’s ability:
P(Y = y|A = a, S = s) = f(a, s)
Where f(a, s) is a function determined through empirical data or theoretical models.
The normalization constant P(Y = y|S = s) assures that probabilities sum to one after updating:
P(Y = y|S = s) = ∑aP(Y = y|A = a, S = s)P(A = a)
This is calculated across all possible acuity levels a to ensure the posterior probability is valid.
Item Response Theory (IRT) is a statistical framework used for modeling the probability of a correct response on a test item based on a person’s latent traits (ability) and the item’s characteristics (difficulty).
$$P(y = 1 | a, d) = \sigma(a - d) = \frac{1}{1 + e^{-(a - d)}}$$
Where:
y = 1 indicates a correct response.
a denotes the ability of the person.
d denotes the difficulty of the item.
σ( ⋅ ) is the logistic function that maps any real-valued number into a (0, 1) range.
The probability of an incorrect response is simply:
P(y = 0|a, d) = 1 − P(y = 1|a, d)
It’s vital to account for potential guessing errors or slips (failing to respond correctly even when the answer is known):
P(y = 1|a, d) = (1 − pslip)σ(a − d) + pguess
Where pslip is the probability of making a mistake despite knowing the answer, and pguess is the probability of guessing correctly.
When new observations arise, we can iteratively update our beliefs. Each new observation’s corresponding posterior becomes the prior for the next iteration.
For two observations, we have:
$$\begin{aligned}
P(A | Y_1 = y_1, S_1 = s_1) & \text{ (update belief with the first observation)} \\
P(A | Y_2 = y_2, S_2 = s_2) & \text{ (update belief with the second observation using the posterior from the first)}\end{aligned}$$
This process loops through each observation, gradually refining our estimates of acuity based on cumulative evidence.
By applying Bayes’ theorem and item response theory, we can create a robust framework for estimating human visual acuity based on probabilistic models. Understanding how to update beliefs with new evidence is fundamental not only in this context but across various domains.
The core idea of probabilistic models is the concept of joint distributions. A joint distribution allows us to describe the probability of multiple random variables occurring at once:
P(X1, X2, …, Xn)
This represents the complete information about the probabilistic model involving random variables X1, X2, …, Xn.
Once we have a joint distribution, we can perform inference, which involves updating beliefs about one or more variables given the observed values of others.
An example discussed was whether one could determine the gender of an elephant from its weight. We start with the prior distribution:
P(Gender|Weight)
and apply Bayes’ theorem to compute the posterior distribution:
$$P(Gender | Weight) = \frac{P(Weight | Gender) \cdot P(Gender)}{P(Weight)}$$
In more complex scenarios, such as inferring a variable that can take multiple values, we may need to perform multiple iterations of Bayes’ theorem.
A Bayesian Network is a graphical model that represents a set of variables and their conditional dependencies through a directed acyclic graph (DAG).
Each node represents a random variable, while directed edges imply causality. For example, consider the following variables:
F: Flu (disease)
T: Tiredness (symptom)
Fe: Fever (symptom)
U: Undergraduate (pre-existing condition)
The structure might look like this:
$$\begin{array}{c}
\text{Flu} \\
\downarrow \\
\text{Fever, Tiredness} \\
\downarrow \\
\text{Undergraduate}
\end{array}$$
To define a Bayesian Network, we need to specify probabilities for each variable conditioned on its parents. For example:
P(Fe|F) and P(T|F, U)
This leads to a product of probabilities that reconstructs the joint distribution:
P(F, T, Fe, U) = P(F) ⋅ P(U) ⋅ P(Fe|F) ⋅ P(T|F, U)
In probabilistic modeling, understanding independence and the relation between random variables is paramount.
Two random variables X and Y are independent if:
P(X ∩ Y) = P(X) ⋅ P(Y)
This dictate that knowledge about X provides no information about Y, and vice versa.
Covariance provides a measure of how much two random variables co-vary. It is defined as:
Cov(X, Y) = E[(X−E[X])(Y−E[Y])]
Alternatively, it can be computed as:
Cov(X, Y) = E[XY] − E[X]E[Y]
To discover the independence structure within data, one can calculate the covariance matrix. For instance, if variables have high covariance, they may be related.
This lecture introduced us to probabilistic modeling and Bayesian networks. Understanding joint distributions and inference allows us to compute complex relationships in data efficiently. Furthermore, covariance and independence are essential for developing robust probabilistic models.
In upcoming lectures, we will explore rejection sampling and further applications of Bayesian networks.
In today’s lecture, we cover important concepts related to probability inference, particularly focusing on an algorithm called Rejection Sampling. This technique is critically useful in probabilistic models, particularly Bayesian Networks (B-Nets).
A Bayesian Network is a graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). The notable features include:
Nodes represent random variables.
Directed edges represent conditional dependencies.
The joint probability distribution can be compactly encoded as follows:
$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n P(X_i \mid \text{Parents}(X_i))$$
Inference in a Bayesian Network involves calculating conditional probabilities that define the relationships among variables. Given random variables A and B:
$$P(A | B) = \frac{P(A, B)}{P(B)}$$
Computing P(A, B) typically requires marginalization over other variables, potentially yielding computationally intense calculations if many variables exist.
Covariance quantifies how much two random variables change together. For random variables X and Y, it is defined as:
Cov(X, Y) = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y]
The range of covariance is not bounded, which complicates interpretations.
Correlation, a scaled version of covariance, is bounded between -1 and 1. It is computed as:
$$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$
High correlations should be interpreted carefully, as non-independent variables can also yield zero correlation.
Rejection sampling is a Monte Carlo method for generating random samples from a distribution by using easy-to-sample distributions. The primary steps are:
Generate a large number of samples from the joint distribution.
Retain samples that conform to specific observed conditions.
The algorithm can be summarized as follows:
Define a Bayesian Network and its parameters.
Collect observations O.
For a defined number of samples N:
Generate a sample S from the joint distribution.
If S matches the observations O, store it.
Estimate the conditional probability based on retained samples.
If observations are rare, many samples will be rejected, leading to inadequate estimates of probability.
To sample continuous random variables, such as temperature (fever), the following process can be employed:
Generate from a normal distribution based on the parent variable’s state (e.g., flu).
Discretize the continuous variable to ensure matches during rejection sampling.
Understanding Bayesian inference and rejection sampling enables practical applications in various domains such as healthcare (e.g., WebMD) and education (e.g., student feedback). Advanced methods like Markov Chain Monte Carlo (MCMC) can further refine sampling methods by eliminating the rejection stage.
Rejection sampling serves as an efficient algorithm for inference in Bayesian Networks when conditions and relationships among variables are understood thoroughly. The applicability extends beyond academia to real-world problems where probabilistic modeling is essential.
A random variable is a variable whose values depend on the outcomes of a random phenomenon.
Representing likelihoods of outcomes as distributions rather than single values leads to a richer understanding.
Consider the like-to-dislike ratios of videos as a measure of probability.
Video 1: 10,000 likes and 50 dislikes. Video 2: 10 likes and 0 dislikes.
Initial probability assessments based only on likes lead to erroneous conclusions.
Different people give the same likelihood of rain (0.8) but with different confidence levels.
Need for a mathematical framework that allows us to evaluate the confidence in probabilities.
Bayes’ Theorem is fundamental in updating beliefs based on new evidence.
The theorem is expressed as:
$$P(H | E) = \frac{P(E | H) \cdot P(H)}{P(E)}$$
Where:
P(H|E) is the posterior probability.
P(E|H) is the likelihood.
P(H) is the prior.
P(E) is the marginal likelihood, often simplified when calculations are done over a range of H.
Redefining probabilities as random variables enhances expressive power.
Let X be the random variable representing the probability of heads when flipping a plate.
The Beta distribution is used to model random variables that are constrained to intervals [0, 1].
If n and m are the number of heads and tails respectively, the probability density function (PDF) is given as:
$$f(x; n, m) = \frac{x^{n-1} (1-x)^{m-1}}{B(n, m)}$$
Where B(n, m) is the Beta function.
When new data is observed (heads and tails), use updated parameters:
Posterior ∼ Beta(n + 1, m + 1)
A conjugate prior allows for a prior and likelihood to be in the same family of distributions.
If the prior is a Beta distribution, the posterior remains a Beta distribution after observing data.
The expected value E[X] of a Beta distributed random variable is given by:
$$E[X] = \frac{n}{n + m}$$
The mode is the value where the PDF reaches its maximum, given by:
$$\text{Mode} = \frac{n - 1}{n + m - 2} \quad \text{(for } n > 1 \text{ and } m > 1\text{)}$$
Multi-armed bandit problem exemplifies exploration versus exploitation.
Algorithms like Thompson Sampling utilize Beta distributions to balance the need for gathering information and exploiting known best options.
Strategy: Sample from Betas of each arm and select the one with the higher sample value.
Understanding and utilizing random variables leads to improved decision-making and deeper insights into uncertain processes.
Probability should be treated as a distribution to capture uncertainty rather than a single static number.
Today, we will discuss the addition of random variables, a fundamental concept in probability theory with profound implications in various fields such as statistics, data science, and machine learning.
An engaging story was shared about the history of the lecturer’s family, transitioning into themes of randomness and unpredictability, connecting it thematically to probability and the essence of random variables.
Understanding how to add random variables is critical for solving various real-world problems, including zero-sum games and predicting behaviors in uncertain environments.
A collection of random variables is said to be IID if:
They are independent: The occurrence of one random variable does not affect the others.
They are identically distributed: All variables share the same probability distribution, implying equal expected values and variances.
Consider the following cases: 1. Xi are independent exponential random variables with the same parameter λ. 2. Xi are independent exponential random variables with different parameters λi. 3. X1 = X2 = … = Xn are not independent. 4. Xi are independent binomial random variables with different trials but the same probability of success.
Discussion with peers about these cases illustrates the nuances of independence and identical distribution.
Let’s denote two random variables X and Y. We wish to understand the properties of the sum Z = X + Y. The probability mass function for Z, assuming X and Y are independent, can be derived as follows:
$$P(Z = n) = \sum_{i=0}^{n} P(X = i) P(Y = n - i)$$
If X and Y are not independent, you would use the joint probability distribution:
$$P(Z = n) = \sum_{i=0}^{n} P(X = i, Y = n - i)$$
1. Sum of Binomials: If X1 ∼ Binomial(n1, p) and X2 ∼ Binomial(n2, p), then X = X1 + X2 ∼ Binomial(n1 + n2, p).
2. Sum of Normals: If X1 ∼ N(μ1, σ12) and X2 ∼ N(μ2, σ22) are independent, then:
X = X1 + X2 ∼ N(μ1 + μ2, σ12 + σ22)
3. Sum of Poissons: If X1 ∼ Poisson(λ1) and X2 ∼ Poisson(λ2), then:
X = X1 + X2 ∼ Poisson(λ1 + λ2)
One of the most beautiful results in probability is the Central Limit Theorem, which states: If X1, X2, …, Xn are IID random variables with finite mean μ and finite variance σ2, then:
$$Z_n = \frac{\sum_{i=1}^{n} X_i - n\mu}{\sigma \sqrt{n}} \xrightarrow{d} N(0, 1)$$
as n → ∞.
Thus, regardless of the initial distribution of the random variables, their sum will approximate a normal distribution given a sufficiently large n.
Zero-Sum Games: Modeling and calculating probabilities in games (such as basketball knowledge using ELO ratings) can leverage the variance and distributions of player abilities.
Simulation: Practical computational methods show normal approximations among various distributions using simulations.
The lecture discussed multiple aspects of adding random variables, particularly focusing on the powerful insights provided by the Central Limit Theorem and its applications in real-world scenarios. The beauty and depth of this topic underline the importance of understanding probability and random processes.
The lecture addresses the importance of understanding intelligence in the context of decision making with a specific focus on modeling human uncertainty and making optimal decisions with limited information.
The Central Limit Theorem states that the sum (or average) of a large number of independent and identically distributed (IID) random variables will be approximately normally distributed, regardless of the original distribution of the variables.
Let X1, X2, …, Xn be IID random variables with mean μ and variance σ2. The CLT can be formulated as:
Sn = X1 + X2 + … + Xn ∼ N(nμ, nσ2)
As n → ∞, the distribution of Sn approaches a Normal distribution N(nμ, nσ2).
For instance, consider rolling a fair six-sided die. The expected value and variance for a single die roll are:
$$\mu = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5, \quad \sigma^2 = \frac{(1 - 3.5)^2 + (2 - 3.5)^2 + (3 - 3.5)^2 + (4 - 3.5)^2 + (5 - 3.5)^2 + (6 - 3.5)^2}{6} = \frac{35}{12}$$
Thus, when summing n dice rolls:
$$\text{Mean} = n \cdot \mu = 3.5n, \quad \text{Variance} = n \cdot \sigma^2 = n \cdot \frac{35}{12}$$
The sample mean X̄ calculated from a sample of n observations will have:
X̄ ∼ N(μ, σ2/n)
This shows that the sampling distribution of the mean is also normally distributed with mean μ and variance σ2/n.
It is critical to estimate population parameters from sample data. The sample mean is defined as:
$$\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$$
and it is an unbiased estimator of the population mean μ.
The sample variance S2 is calculated as:
$$S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$
The divisor n − 1 (Bessel’s correction) is used to obtain an unbiased estimator of the population variance σ2.
When estimating parameters, we often use confidence intervals. A 95% confidence interval for the mean, using the standard error (SE) of the mean, is calculated as:
$$\bar{X} \pm Z_{\alpha/2} \cdot \text{SE} = \bar{X} \pm Z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}$$
where Zα/2 is the Z-value for the desired confidence level.
The CLT is a cornerstone of statistical theory that enables data analysts to make inferences about population parameters based on sample statistics. It is crucial to understand how to properly estimate these parameters and the limits of those estimations, especially concerning biases and sampling methods.
During today’s lecture, we will cover several important concepts in probability and statistics:
Counting Theory
Core Probability
Random Variables
Central Limit Theorem
Bootstrapping
P-Values
The Central Limit Theorem (CLT) states that if you take a sufficiently large sample size n of independent and identically distributed (IID) random variables, the sample mean will be approximately normally distributed regardless of the shape of the original population distribution. Mathematically:
$$\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$$
where X̄ is the sample mean, and Xi are the individual random variables.
The variance of the sample mean is given by:
$$\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$$
where σ2 is the population variance.
The sample variance s2 can be estimated using the formula:
$$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$
The n − 1 term (Bessel’s correction) is used to obtain an unbiased estimator of the population variance.
The bootstrapping method allows estimation of the distribution of a statistic (e.g., mean, variance) by resampling with replacement from the original sample. The steps are as follows:
Take n samples from the population to create a sample of size n.
Construct the empirical probability mass function (PMF) from the sample.
Resample n observations with replacement from the constructed PMF.
Calculate the desired statistic (e.g., sample variance) for each resample.
Repeat the resampling process B times to generate a distribution of the statistic.
The P-value is a measure of the strength of the evidence against the null hypothesis. It quantifies how likely it would be to observe data as extreme as what we did, given that the null hypothesis is true.
The P-value can be calculated as follows:
Define the null hypothesis (H0) and alternative hypothesis (H1).
Calculate the observed statistic (e.g., difference in means).
Use bootstrapping to generate a distribution of the statistic under H0.
Determine the proportion of bootstrapped statistics that are more extreme than the observed statistic:
$$P\text{-value} = \frac{\text{Number of extreme statistics}}{B}$$
For example, to test whether there is a difference in means between two groups:
Combine the two groups into a single “universal” distribution under the null hypothesis.
Resample two groups with replacement from this combined distribution.
Calculate the difference in means for these resamples.
Accumulate a distribution of differences.
Compute the P-value based on how many differences are greater than or equal to the observed difference.
$$\begin{aligned}
D_{mean} &= \bar{X}_1 - \bar{X}_2 \\
P\text{-value} &= P(D_{sample} \geq D_{observed} \mid H_0 \text{ is true})\end{aligned}$$
Understanding bootstrapping and P-values provides robust tools for statistical inference, allowing researchers to express confidence in their reported statistics. Bootstrapping allows for flexible analyses without relying on strong parametric assumptions.
This lecture focuses on algorithmic analysis, particularly dealing with expectation, law of total expectation, and their implications in algorithm design and analysis. The lecture included various problems, one of which was difficult, as indicated by the instructor.
The expected value (or mean) of a random variable X can be calculated as follows:
E[X] = ∑xx ⋅ P(X = x)
where P(X = x) is the probability that the random variable X takes on the value x.
One essential property of expectation is linearity. If we have several random variables X1, X2, …, Xn, the expectation of their sum is the sum of their expectations:
$$E\left[\sum_{i=1}^{n} X_i\right] = \sum_{i=1}^{n} E[X_i]$$
This property holds regardless of whether the random variables are independent.
A limitation of the expected value is that it summarizes a probability distribution into a single number, which may overlook important characteristics of the distribution, such as variance and tail behavior.
The expected value of a function g(X) of a random variable X can be computed using the following formula:
E[g(X)] = ∑xg(x) ⋅ P(X = x)
This is useful in calculating expectations of functions of random variables.
An indicator random variable I for an event E is defined as:
$$I = \begin{cases}
1, & \text{if event } E \text{ occurs} \\
0, & \text{if event } E \text{ does not occur}
\end{cases}$$
The expected value of an indicator random variable is equal to the probability of the event:
E[I] = P(E)
A binomial random variable Y can be defined as the sum of n independent Bernoulli trials (indicator random variables):
$$Y = \sum_{i=1}^{n} I_i$$
where each Ii represents a success (1) or failure (0) in a Bernoulli trial with success probability p. The expected value of Y is:
E[Y] = n ⋅ p
The negative binomial distribution counts the number of trials until the rth success. Its expected value is given by:
$$E[Y] = r \cdot \frac{1}{p}$$
where p is the probability of success in each trial.
Consider a computer cluster with K servers and independent request probabilities pi for each server. Let Ai be an indicator variable that represents whether server i is idle. The total number of idle servers X can be expressed as:
$$X = \sum_{i=1}^{K} A_i$$
To find the expected number of servers not idle, Y:
Y = K − X ⇒ E[Y] = K − E[X]
Furthermore, if Ai is the indicator for an event, we can compute E[Ai] as 1 − pi.
In the classic "coupon collector’s problem," let X be the number of trials until each of n buckets receives at least one coupon. The expected value of X is given by:
$$E[X] = n \cdot \sum_{k=1}^{n} \frac{1}{k} \approx n \cdot \ln(n) + \gamma n$$
Here, γ is the Euler-Mascheroni constant.
Differential privacy is a method of ensuring privacy when dealing with sensitive datasets. The fundamental approach involves adding random noise to the outputs of queries on the dataset. The expected output under differential privacy can be analyzed with the previously discussed laws of expectation.
The lecture covered a wide range of topics related to expectation and algorithmic analysis, with applications in cloud computing and data privacy. The concepts of expected values, indicator random variables, and the linearity of expectation are foundational in algorithmic design.
Machine learning is a vital part of artificial intelligence that focuses on learning from data. This lecture explores the foundational concepts of parameter estimation, particularly Maximum Likelihood Estimation (MLE) and its applications in building machine learning models.
In probabilistic models, parameters define the characteristics of the distribution:
Parameters are denoted by Θ.
For a normal distribution, parameters might include the mean (μ) and variance (σ2).
For a Poisson distribution, the parameter is denoted as λ.
Modeling: Define a probabilistic model for the real-world problem.
Training: Estimate parameters using available training data.
Prediction: Use the model with learned parameters for predictions on new data.
MLE is a method for estimating the parameters of a statistical model. It defines the likelihood of observing the given data under specific parameter values.
For independent and identically distributed (IID) data points, the likelihood function L(θ) is given by:
$$L(\theta) = \prod_{i=1}^{n} P(X_i | \theta)$$
where P(Xi|θ) is the probability of observing data point Xi given parameter θ.
Taking the logarithm simplifies calculations:
$$\log L(\theta) = \sum_{i=1}^{n} \log P(X_i | \theta)$$
To find the MLE, we maximize the log-likelihood function:
θ̂ = arg maxθlog L(θ)
This can involve solving:
$$\frac{d}{d\theta} \log L(\theta) = 0$$
The probability mass function for a Poisson distribution is:
$$P(X = k | \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}$$
Thus, the log-likelihood is:
$$\log L(\lambda) = \sum_{i=1}^{n} \left( X_i \log \lambda - \lambda - \log(X_i!) \right)$$
For n independent samples, the maximum likelihood estimator for λ is:
$$\hat{\lambda} = \frac{1}{n} \sum_{i=1}^{n} X_i$$
The probability mass function for a Bernoulli distribution is:
P(X = x|p) = px(1 − p)1 − x
The log-likelihood becomes:
$$\log L(p) = \sum_{i=1}^{n} \left( X_i \log p + (1 - X_i) \log(1 - p) \right)$$
To find MLE for p:
$$\hat{p} = \frac{1}{n} \sum_{i=1}^{n} X_i$$
Asymptotic Properties: As sample size n → ∞, MLE converges to the true parameter values.
Bias of Estimators: MLE may be biased for small sample sizes; the sample mean is an unbiased estimator for normal distribution, while variance often requires Bessel’s correction (n − 1).
Continuous Probability Distributions: For distributions without closed forms for the MLE, numerical optimization techniques can be used.
Identifying parameters in distributions such as Gaussian and Bernoulli.
Generalizing across distributions using algorithms like Expectation-Maximization (EM) for models that include latent variables.
Understanding Maximum Likelihood Estimation provides the mathematical background necessary for various machine learning techniques and helps in building more sophisticated models.
In this class, we are transitioning from pure probability theory to the fundamentals of machine learning. Our objective is to understand parameter estimation as the core challenge of machine learning, and how this relates to deep learning. The main focus will be on estimating parameters from probabilistic models based on given data.
Parameter estimation is the process of inferring the values of parameters (Θ) in a probabilistic model based on observed data.
Goals:
Model real-world problems using probabilistic models.
Estimate parameters using historical data.
To denote the parameters, we will use the symbol Θ (theta).
The first significant method for estimating parameters is Maximum Likelihood Estimation (MLE). The idea is to choose parameters that maximize the likelihood of observing the given data.
Given a dataset {x1, x2, …, xn}, the likelihood L(Θ) can be written as:
$$L(\Theta) = P(X|\Theta) = \prod_{i=1}^{n} P(x_i|\Theta)$$
Where P(xi|Θ) is the probability of observing each data point given the parameters Θ.
To simplify calculations, we often work with the log-likelihood, denoted as:
$$\log L(\Theta) = \sum_{i=1}^{n} \log P(x_i|\Theta)$$
Maximizing the log-likelihood is computationally easier than maximizing the likelihood function directly.
To find the maximum, we take the derivative of the log-likelihood with respect to Θ:
$$\frac{d}{d\Theta} \log L(\Theta) = 0$$
The critical points can be found by setting the above derivative to zero.
MLE can lead to overfitting, where the estimated parameters define the model too closely based on the training data, potentially failing on new data.
When maximizing functions, a traditional approach may not always yield an optimal solution efficiently. Gradient ascent provides a method to incrementally reach the maximum:
Start with an initial guess Θ0.
Compute the gradient g of the log-likelihood at that point.
Update the parameters:
Θi + 1 = Θi + αg
Where α is the step size.
The gradient ascent method will lead to the highest point (maximum) in the log-likelihood function. However, if the steps are too large, it might overshoot the maximum.
Unlike MLE, Bayesian estimation incorporates prior beliefs about the parameters through Maximum A Posteriori (MAP) estimation.
To estimate the parameters using Bayes’ theorem, we focus on:
$$P(\Theta|X) = \frac{P(X|\Theta)P(\Theta)}{P(X)}$$
Where: 1. P(Θ|X) is the posterior. 2. P(X|Θ) is the likelihood. 3. P(Θ) is the prior belief.
To estimate parameters using MAP:
ΘMAP = argmaxΘP(Θ|X) = argmaxΘP(X|Θ)P(Θ)
This is akin to performing MLE while also considering the prior:
log P(Θ|X) = log P(X|Θ) + log P(Θ)
MAP estimation helps mitigate the overfitting issues associated with MLE by incorporating prior distributions for Θ.
In situations where the prior distribution and the likelihood function are of the same family, we say the prior is a conjugate prior. Some common conjugate distributions include:
Beta distributions for Bernoulli/binomial models.
Gamma distributions for Poisson models.
Normal distributions for Gaussian models.
Given a Bernoulli trial data of successes and failures, if we want to estimate the probability p, we would use:
P(p|data) ∝ P(data|p)P(p)
Where P(p) can be a beta prior, resulting in a posterior that is also a beta distribution.
To compute MAP estimates:
Specify a prior.
Collect data and calculate the likelihood.
Use the log-likelihood expression plus the prior’s log to optimize using gradient ascent or other optimization techniques.
The transition from MLE to MAP and the introduction of priors significantly enhances our ability to make robust parameter estimates that generalize better to new data. This will be foundational as we explore more complex machine learning algorithms in the upcoming classes.
Naive Bayes is a key machine learning algorithm that will be explored in the course. It serves as the first machine learning algorithm. It is particularly useful for classification tasks.
Machine learning can be visualized as employing probabilistic models with learned parameters to make predictions based on input data.
The two main approaches for parameter estimation are:
Maximum Likelihood Estimation (MLE): Chooses parameters making the observed data most probable.
Maximum A Posteriori (MAP): Similar to MLE, but also accounts for prior beliefs regarding parameters.
Inputs and outputs in classification tasks are denoted using specific notation:
Let Xi represent input features for the i-th data point.
Let Y be the output variable (the label we wish to predict).
Superscripts denote specific instances (e.g., Y1, Y2 for different users).
Bold variables (e.g., X) indicate vectors containing input values.
The naive Bayes classifier is based on the powerful assumption of conditional independence of inputs given the output class:
P(X|Y) = P(X1|Y) ⋅ P(X2|Y)⋯P(Xm|Y)
Where X = (X1, X2, …, Xm) are input features.
The prediction formula for the naive Bayes classifier can be expressed as:
P(Y|X) ∝ P(Y) ⋅ P(X|Y)
After applying the naive Bayes assumption, we focus on the numerator:
$$P(Y | X) \propto P(Y) \cdot \prod_{i=1}^{m}P(X_i | Y)$$
Parameter estimation for naive Bayes involves:
Estimating P(Y = 1) and P(Y = 0) using counting from training data.
For each feature Xi, estimating P(Xi|Y) using MLE or MAP. For MAP with LaPlace smoothing:
$$P(X_i = 1 | Y) = \frac{\text{count}(X_i = 1, Y) + 1}{\text{count}(Y) + 2}$$
Where count(Xi = 1, Y) denotes the number of instances where Y has a particular class.
To make a prediction for a new data point:
Compute P(Y = 1|X) and P(Y = 0|X) using the derived parameters.
Choose the class with the higher probability:
Ŷ = arg maxYP(Y|X)
A notable application of naive Bayes is in email spam detection. In this scenario:
Features are individual words (or their counts) in the email.
Y is a binary label indicating whether an email is spam or not.
Parameters of the model are estimated from a labeled dataset of emails.
To evaluate the classifier’s performance, standard metrics include:
Accuracy: Overall correctness $\frac{\text{correct predictions}}{\text{total predictions}}$.
Precision: Ratio of true positive predictions to the total positive predictions made.
Recall: Ratio of true positive predictions to the actual positives in the data.
Naive Bayes is a foundational algorithm for classification that simplifies complex probability calculations through its naive independence assumption. It forms the basis for understanding more complex machine learning methodologies and remains relevant due to its efficiency and effectiveness, particularly in situations with high-dimensional data, such as text classification.
This course covers fundamental concepts and algorithms in Machine Learning.
Key topics include classification algorithms, parameter estimation, and logistic regression, leading to applications in deep learning.
Classification: The task of predicting discrete class labels (e.g., 1 or 0).
Two primary algorithms:
Naive Bayes: Assumes independence among features given the class label.
Logistic Regression: A flexible model used to estimate probabilities of class membership.
Build a model that maps input features x to an output class label y.
Use training data (x, y) to learn model parameters.
Evaluate the model’s performance using a reserved test dataset.
MLE is used to estimate the parameters that maximize the likelihood of the observed data.
For a probabilistic model defined by a training dataset, the goal is to maximize the likelihood:
L(θ) = P(Y|X; θ)
where θ are the model parameters.
Logistic regression predicts the probability that y = 1 given input features x.
The logistic regression model is defined as follows:
$$P(y=1 | \mathbf{x}; \theta) = \sigma(\theta^T \mathbf{x}) = \frac{1}{1 + e^{-\theta^T \mathbf{x}}}$$
where σ(z) is the sigmoid function transforming the weighted sum into a probability.
The sigmoid function is defined as:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Properties:
Output is always between 0 and 1; useful for interpreting outputs as probabilities.
Derivative:
σ′(z) = σ(z)(1 − σ(z))
We introduce an intercept term by including a x0 = 1, hence:
z = θTx = θ0 + θ1x1 + θ2x2 + … + θmxm
The model predicts:
P(y = 1|x) = σ(z)
The log-likelihood function is computed for the training data:
$$\ell(\theta) = \sum_{i=1}^n [y_i \log(\sigma(z_i)) + (1-y_i) \log(1 - \sigma(z_i))]$$
where n is the number of training instances.
To optimize the parameters, compute the gradient of the log-likelihood:
$$\frac{\partial \ell}{\partial \theta_j} = \sum_{i=1}^n (y_i - \sigma(z_i)) x_{ij}$$
Update the parameters iteratively:
$$\theta_j \leftarrow \theta_j + \alpha \frac{\partial \ell}{\partial \theta_j}$$
where α is the learning rate.
After training, to make a prediction for an input x:
Compute z = θTx
Calculate the predicted probability: ŷ = σ(z)
Convert to binary prediction:
$$\widehat{y} =
\begin{cases}
1 & \text{if } \hat{y} > 0.5 \\
0 & \text{otherwise}
\end{cases}$$
Logistic regression serves as a foundation for understanding more complex models in deep learning.
Deep learning represents a subset of machine learning focused on algorithms that learn from data in a hierarchical manner, inspired by the structure of the human brain.
Understand the core concepts of deep learning.
Familiarize with the algorithms behind many modern applications.
Recognize how deep learning can be used for various tasks such as image recognition and generative models.
Logistic regression is a fundamental algorithm in supervision classification where:
Inputs are weighted and passed through a sigmoid function to yield a probability.
The equation for logistic regression can be expressed as:
ŷ = σ(θTx)
where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.
A neural network can be visualized as multiple logistic regression layers stacked together.
Each neuron performs its own logistic regression, with distinct parameters (weights).
A basic feedforward network structure comprises:
Input layer (features)
Hidden layers (neurons applying transformations)
Output layer (final outcomes)
Each hidden layer neuron hj can be expressed mathematically as:
$$h_j = \sigma\left(\sum_{i=0}^{m} \theta_{ji} x_i\right)$$
The forward pass through a neural network involves:
Passing input data through the network to generate predictions.
Each layer calculates its outputs based on the previous layer’s outputs.
yhat = σ(W⋅H+b)
To train a neural network, we need to optimize the weights through a process known as backpropagation, which relies on:
Gradient ascent for maximizing the likelihood function.
Using the chain rule to compute gradients of weights with respect to the loss function.
The loss function for binary outcomes can be represented as:
L(y, ŷ) = − (ylog(ŷ)+(1−y)log(1−ŷ))
In backpropagation:
The chain rule allows for connecting derivatives for nested functions:
$$\frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial y^{hat}} \cdot \frac{\partial y^{hat}}{\partial h_j} \cdot \frac{\partial h_j}{\partial \theta}$$
Image Classification (e.g., digit recognition)
Natural Language Processing (NLP)
Generative Models (e.g., GANs for image generation)
Self-Driving Cars
Health Diagnostics (e.g., cancer detection through image analysis)
Deep learning enables us to create powerful predictive models by stacking layers of neurons (logistic regressions) and training them using large datasets. It is becoming ubiquitous in applications ranging from healthcare to art creation. Understanding the foundations, such as logistic regression and how neural networks operate, is key to advancing in this field.
To delve deeper into deep learning:
Explore hyperparameters and their significance.
Investigate optimization techniques beyond basic gradient ascent, like Adam or RMSprop.
Study architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
Today, we are diving into the important topic of the probability behind ethics in machine learning. Understanding machine learning is crucial for addressing the impacts and implications of smarter algorithms on society.
The philosophical stance gravitating towards is the Mlanous philosophy, which posits that goodness exists inherently within all of us, akin to a sprout that can grow with care.
The allegory of the baby falling into a well illustrates an instinctual response to help, suggesting a hardwired goodness in human nature.
AI influences multiple aspects of our daily lives, such as:
Smartphones
Security cameras
Social media
The dual implications of AI:
It holds great potential for societal benefits (e.g., better healthcare, intelligent electricity grids, access to quality education).
It can also manifest in wayward applications that exhibit biases, calling for ethical considerations.
By the end of this lecture, you should:
Understand the limitations of “fairness through unawareness.”
Recognize measurement techniques for fairness, specifically two measurement frameworks:
Calibrated models
Relaxation of parity
Familiarize yourself with techniques to mitigate fairness issues.
1. Protected Demographic:
- Groups that are safeguarded against discrimination based on characteristics such as race, age, gender, etc.
2. Distributive vs. Quality of Service Harm:
- Quality of Service Harm: AI performs better for certain demographic groups than others (e.g., inaccurate facial recognition).
- Distributive Harm: AI impacts critical opportunities (job applications, loans) unfairly based on demographics, often resulting in more severe consequences.
3. Procedural vs. Distributive Fairness:
- Procedural fairness focuses on the fairness of the process.
- Distributive fairness emphasizes fairness in the outcomes.
- Classification algorithms, like: - Naive Bayes - Logistic Regression
Using logistic regression: - Logistic regression predicts outcomes using a sigmoid function:
$$g(z) = \frac{1}{1 + e^{-z}}$$
where, if z > 0, predict 1; if z < 0, predict 0. - It creates a straight line (hyperplane) that separates classes, limiting its effectiveness with non-linearly separable data.
Fairness through Unawareness: Excluding demographic features to make decisions.
Example: Not considering race when making decisions.
Fairness through Awareness: Taking demographics into account for decisions but ensuring equality in outcomes.
Parity: The probability of a positive prediction should be the same across demographics:
P(Ŷ = 1|D = d0) = P(Ŷ = 1|D = d1)
Calibration: The prediction accuracy should be equal across demographics:
P(Ŷ = T|D = d) = P(Ŷ = T)
Quality of Service Harm: Cameras malfunctioning for specific demographics, rendering poor autofocus.
Caused by training data biases.
Distributive Harm Case Study: An algorithm used for college admissions that inadvertently favored demographic characteristics, hindering female and non-Caucasian applicants.
Improve training data balance to ensure representation.
Implement model cards for transparency in AI model data and applications.
Explore methods to mitigate biases by removing them with neural networks that predict demographic information from outputs.
The story of free basic internet by Facebook illustrates unforeseen consequences, such as misinformation leading to severe societal harm.
Awareness of unintended biases in algorithms can provoke harm, even if the original intent was beneficial.
The ethical framework around machine learning necessitates recognizing blind spots and being adaptable.
Fostering an environment of questioning and critical thinking when deploying algorithms that impact lives is essential.
Please take time to reflect on the ethical implications of your work and the constructs of fairness in machine learning. Your awareness and consideration can shape the future of technology harmoniously.
Generative models learn to generate data similar to the training set, allowing for applications such as text synthesis and image generation.
DALL-E: Generates images from textual descriptions.
GPT-3: Generates text based on given prompts.
Maximum Likelihood Estimation (MLE): A method used to estimate the parameters of a statistical model.
Maximum A Posteriori (MAP) Estimation: Similar to MLE but incorporates prior beliefs about the parameters.
DALL-E uses a neural network to create images from textual descriptions, leveraging the concept of denoising.
The key idea is to start with a heavily noised image and progressively remove noise through a neural network.
To manipulate images, Gaussian noise is added:
Noisy_Image = Original_Image + 𝒩(0, σ2)
where 𝒩(0, σ2) is Gaussian noise with mean 0 and variance σ2.
The neural network is trained to minimize the difference between the predicted noise and the actual noise, leading to the minimization of the sum of squared errors:
Loss(θ) = ∑i(yi − ŷi)2
where yi is the true noise and ŷi is the predicted noise.
The process is repeated several times to generate an image that resembles the target:
Start with completely noisy image.
Apply the trained denoising network multiple times.
Each iteration reduces the noise.
GPT-3 uses a different but related approach to generate coherent and contextually relevant text.
The attention mechanism allows the model to weigh the importance of different words in the context:
$$\text{Attention}(Q, K, V) = \text{softmax} \left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
where Q (queries), K (keys), and V (values) are matrices derived from input words, and dk is the dimensionality of the keys.
The text generation process can be broken down into:
1. Modeling the probability of the next word given previous words:
P(wt|w1, w2, …, wt − 1)
2. Selecting the next word based on the probability distribution obtained:
Sample from P(wt)
Using a sampling strategy like sampling from the predicted probability distribution helps maintain variability in output.
Like DALL-E, GPT-3 is trained using MLE, focusing on maximizing the likelihood of observing the actual next word in the training set relative to the predicted word.
In today’s lecture, we explored the foundational ideas behind two significant generative models: DALL-E and GPT-3.
Generative models are shaping the future of AI with applications spanning multiple domains. Our understanding of these models will play a pivotal role in developing future technologies.
Ensure you have access to practice final exams and resources. During the course, we covered several topics:
Counting and Probability Fundamentals
Single Random Variables
Probabilistic Models
Uncertainty Theory
Artificial Intelligence Applications
Reflection on the potential in research:
There is an abundance of problems to tackle.
Every student has the potential to contribute meaningfully to research.
A vignette was shared about using similarity distance to select representative assignments from a pool of 300 solutions. The goal is to minimize the distance between selected and unselected solutions.
Thompson Sampling is a probabilistic model that can reduce complexity from O(n2) to O(nlog n) by evaluating uncertainty through sampling methods.
P(A|B) ∝ P(B|A)P(A)
The algorithm allows for effective exploration of best candidates by using belief distributions over possible outcomes.
A major theme discussed was the gap in quality education globally. The drive towards improving educational quality can leverage the increasing access to technology (e.g., smartphones) and rich datasets from platforms like Code.org.
A focus was given to the difficulties in providing feedback, especially in coding assignments:
Traditional feedback processes are time-consuming and often inadequate.
A solution was proposed by utilizing probabilistic models to understand students’ latent needs based on their previous work.
The idea of collecting a vast amount of labeled data to inform feedback algorithms can significantly impact students’ learning trajectories. Methods explored include:
Correlation of coding errors with future decisions.
The development of Rubrics to quantify feedback quantitatively.
Bayesian Models allow for:
Modeling students’ progress through probabilistic inference.
Using previous scores and performance, Bayesian networks can help predict and guide future learning paths.
The course introduced complex graphical models for better representation and inference in relation to random variables.
Given two random variables X and Y:
P(X, Y) = P(X|Y)P(Y)
Exploring the relationships between variables can lead to insights in clustering student submissions, reducing redundancy in sample evaluations, and enabling more personalized feedback.
Consider the following classes that follow CS 109:
Decision-Making Under Uncertainty
CS 221 (Artificial Intelligence)
CS 229 (Machine Learning)
CS 228 (Probabilistic Graphical Models)
Encouragement to find unique problem intersections based on personal lived experiences, interests, and academic backgrounds.
The lecture concluded with gratitude expressed towards the students for their engagement throughout the course. It emphasized the importance of continuous exploration in both academic and personal paths.
“You live in interesting times — a time full of potential discoveries and opportunities.”
Familiarize yourself with common phrases such as:
Probability of X: Calculate P(X).
How much more likely is X than Y?: Use the ratio of likelihoods, often expressed as:
$$\frac{P(X)}{P(Y)}$$
Is it significant?: Typically involves calculating a p-value or utilizing bootstrap methods.
The Beta distribution has two parameters, A and B, representing counts of heads and tails, respectively, in a Bernoulli trial.
It has a PDF given by:
$$f(x; A, B) = \frac{x^{A-1} (1-x)^{B-1}}{B(A, B)}, \quad 0 < x < 1$$
The expectation and variance are critical:
$$\text{E}(X) = \frac{A}{A+B}, \quad \text{Var}(X) = \frac{AB}{(A+B)^2(A+B+1)}$$
For independent identically distributed (IID) random variables, the following rules apply:
Poisson: The sum of independent Poisson variables is Poisson with parameter λ = ∑λi.
Normal: Sum the means and variances:
X = X1 + X2 ⟹ μX = μ1 + μ2, σX2 = σ12 + σ22
Binomial: The sum of independent Binomial variables is Binomial with parameters n = n1 + n2.
The sum (or average) of a large number of IID random variables will tend to be normally distributed:
$$\text{If } X_1, X_2, \ldots, X_n \text{ are IID, then }
\sqrt{n} \left( \bar{X}_n - \mu \right) \overset{d}{\to} \mathcal{N}(0, \sigma^2) \text{ as } n \to \infty.$$
Ensure continuity correction when using the CLT for discrete distributions.
The sample mean is computed as:
$$\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$$
The sample variance (unbiased) is:
$$S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$
MLE involves two main steps:
Formulate the likelihood function, typically as:
$$\mathcal{L}(\theta) = \prod_{i=1}^{n} f(x_i; \theta)$$
Maximize the likelihood (or log-likelihood):
$$\log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log f(x_i; \theta)$$
The MAP estimation improves MLE by incorporating a prior:
P(θ|X) ∝ P(X|θ)P(θ)
Resampling method used to estimate the distribution of a statistic (e.g., mean, variance).
Typical pseudocode structure:
for i in range(1000):
resample the data
compute statistic
Based on the independence assumption:
$$P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)}$$
Implement Laplace smoothing to avoid zero probabilities:
$$P(X_i | Y) = \frac{N_{X_i, Y} + 1}{N_Y + |V|}$$
A special case of the Bernoulli distribution model:
P(Y = 1|X) = σ(WTX)
where σ is the logistic function.
For problems involving multiple parameters, derive partial derivatives for each parameter.
Expect questions about applying MLE and MAP in practical scenarios, including parameter estimation and hypothesis testing.
Focus on model applications, derive equations where necessary, and practice interpreting results from sampling methods and estimations.