Notes: CS 188 / Artificial Intelligence

22 minute read

Published:

These notes are for CS 188: Introduction to Artificial Intelligence taught by Mesut Yang and Carl Qi in Summer 2021.

Midterm

Searches

Planning agents search future states by simulating the world. Reflex agents choose an action based on the world’s current state. A search problem consists of a state space, a successor function (with actions and costs), and a start state/goal test(s). A solution transforms the start state to the goal state. A search state is intentionally smaller than the world state.

A state space graph includes nodes as world configurations with arcs representing successors; it’s usually too big to build in memory. A search tree represents plans and their outcomes with the start state as the roto node; we usually can’t build the whole tree. The fringe of a search tree denotes the partial plans under consideration.

Generally, a tree search maintains a fringe and uses a strategy to select a node for expansion. A search algorithm is complete if it is guaranteed to find a solution and optimal if it finds the least cost path. For a given search tree, we notate \(b\) as the branching factor and \(m\) as the maximum depth. Types of searches:

  • Depth first search
    • expands deepest node first, fringe is LIFO
    • complete if there aren’t cycles, but not optimal
    • time complexity \(O(b^m)\) and space complexity \(O(bm)\)
  • Breadth first search
    • expend shallowest node first, fringe is FIFO
    • complete and optimal if costs are the same
    • time complexity \(O(b^s)\) and space complexity \(O(b^s)\), where \(s\) is the depth of the shallowest solution
  • Iterative deepening
    • repeatedly use DFS with incrementing depth limit
    • “best of both worlds”
  • Uniform cost search
    • expand cheapest node first (backward cost), fringe is priority queue
    • complete and optimal with finite costs and positive arc costs
    • time and space complexity \(O(B^{C^*/\epsilon})\), where the solution costs \(C^*\) and arcs cost at least \(\epsilon\)
  • Greedy search
    • estimates distance to solution (forward cost) and expands the node that seems closest
    • complete but not optimal
    • worst case is like a bad DFS
  • A* search
    • expands nodes with the lowest total backward and forward cost
    • requires an admissible (consistent) heuristic \(h\) where \(0 \leq h(n) \leq h^*(n)\) where \(h^*\) is the true cost, often derived from relax problems
    • there’s a tradeoff between the quality of the heuristic’s estimate and work per node
    • a heuristic \(h_a\) dominates \(h_c\) if \(\forall n, h_a(n) \geq h_c(n)\); the trivial heuristic always gives 0
    • complete and optimal, faster than BFS since it expands “toward” the goal

A tree search that fails to detect repeated states can cause exponentially more work. Graph search keeps track of a set of expanded states (closed set) and skips nodes that have already been visited. A* graph search must be consistent, which means \(h(A) - h(A) \leq \text{cost}(A\text{ to }C)\). Consequently, the forward cost along a path never decreases.

Game Trees

Adversarial games are deterministic games with two or more players that are zero sum and include perfect information. A strategy recommends a move from the current state. An example formulation could include states \(S\), players \(P\), actions \(A\), a transition function \(S \times A \to S\), a terminal test \(S -\to {t, f}\), and terminal utilities \(S \times P \to R\). A solution is a policy. Whereas a general game involves independent utilities for agents, a zero-sum game can be represented with a single value.

The value of a state is the best achievable utility from the state. In minimax, players alternate turns and each player computes the best utility against an optimal adversary. Minimax may not be optimal against a non-optimal adversary that makes mistakes. Just like DFS, it has time complexity \(O(b^m)\) and space complexity \(O(bm)\).

Alpha-beta pruning prunes nodes that would never be picked. Suppose we are MAX and we are computing the MIN-VALUE at some node \(n\). Let \(\alpha\) be the best value that MAX can get at any choice along the current path to the root. If \(n\)’s value becomes worse than \(\alpha\), then MAX will avoid it and stop considering the other children. A symmetric logic holds for the MAX version and \(\beta\).

Depth-limited search computes terminal utilities with an evaluation function, an approximation of the minimax value is usually a weighted linear sum of features. It’s much more realistic, but there’s no guarantee of optimal play. With insufficient depth and a bad evaluation function, we could have re-planning agents that thrash around.

Actions could have uncertain outcomes because of unseen variables, unpredictable opponents, and failure to act. Expectimax search computes the average score under optimal play with chance nodes instead of min nodes. Having a probabilistic belief about another agent does not necessarily mean that the other agent is actually acting randomly.

Reinforcement Learning

A Markov Decision Process (MDP) is defined by a set of states \(S\), a set of actions \(A\), a transition function \(T(s, a, s')\), a reward function \(R(s, a, s')\), a start state, and potentially a terminal state. One way to solve them is expectimax. For sequences of utilities, we can model a preference for utility sooner by exponentially discounting future utility by a discount factor $$gamma$. If a game last forever, we could consider a finite horizon, use discounting, or provide an absorbing state.

Define the value of a state \(V^*(s)\) as the expected utility starting in \(s\) and acting optimally, the value of a q-state \(Q^*(s, a)\) as the expected utility after having taken an action from a state \(s\) and acting optimally, and the optimal policy \(\pi^*(s)\) as the optimal action from \(s\). The expectimax value of the state is: \(V^*(s) = \max_a Q^*(s, a) = \max_a T(S, a, s') [R(s, a, s') + \gamma V^*(s')]\) We can perform value iteration to update the values using the Bellman equation until convergence, with complexity \(O(S^2A)\). For a fixed policy \(\pi\), we can compute the utility of a state \(s\) as: \(V^\pi (s) = \sum_{s'} T(s, \pi(s), s') [R(s, \pi(s), s') + \gamma V^\pi(s')]\) The efficiency for an update would be \(O(s^2)\), but we could also just solve it as a system of linear equations. To extract a policy from values, we can evaluate: \(\pi^*(s) = \arg \max_a \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V^\pi(s')]\) Value iteration is slow, with the a max that rare changes and a policy that converges far before the values do. Policy iteration involves performing policy evaluation and updating the policy using policy extraction.

These are all examples of offline planning, where quantities are determined through computation. Reinforcement learning involves receiving feedback in the form of rewards and learning to maximize expected rewards - we don’t know the transition function or the reward function.

In model based learning, we use repeated episodes to estimate the transition function and determine the rewards. An alternative is model-free learning. In passive reinforcement learning, we perform direct evaluation by averaging the sum of discounting rewards from each state across samples. Temporal difference learning places greater weights on recent samples: \(V^\pi (s) \leftarrow (1-\alpha) V^\pi(s) + \alpha [R(s, \pi(s), s') + \gamma V^\pi(s')]\)

But we can’t extract a policy from TD value learning. To update our policy, we need Q-learning, which uses a similar update function: \(Q(s, a) \leftarrow (1 - \alpha) Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a')]\)

Surprisingly, Q-learning converges to the optimal policy even if you’re acting suboptimally, a process caleld off-policy learning. To find the optimal policy, the agent has to balance exploration and exploitation. We could have the agent occasionally take random moves, but a better approach is to reward unvisited states with estimated value \(u\) and visit count \(n\) with an optimistic utility \(f(u, n) = u + k/n\).

Regret measure total mistake cost, the difference between expected rewards and optimal rewards. Minimizing regret requires optimally learning to be optimal. In the real world, we’d like to generalize across similar states using feature-based representation. The q function is expressed using weights: \(Q(s, a) + w_1 f_1(s, a) + w_2 f_2(s, a) + ... + w_nf_n(s, a)\)

With linear Q-functions, updates adjust the weights of active features: \(w_i \leftarrow w_i + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)] f_i(s, a)\)

Approximation works because it moves in the direction of the gradient that minimizes error. If we had only one point \(x\) and features \(f(x)\). a target value \(y\) and weights \(w\), then notice that taking the derivative of least-squared error justifies the linear Q-function update: \(e(w) = \frac{1}{2}(y - \sum_k w_kf_k(x))^2\\ \frac{\partial e(w)}{\partial w_m} = -(y - \sum_k w_k f_k(x)) f_m(x)\)

Approximate Q-learning lets us generalize and avoid overfitting. Even if it doesn’t model Q-values well, it can still be effective at maximizing rewards. Policy search involves starting with an ok solution and fine-tuning with hill-climbing. To effectively hill-climb, methods can exploit lookahead structure, sample wisely, and change multiple parameters at a time.

Final

Bayes Nets

Probability

In an environment with uncertainty, we use observed variables (evidence) to build models that relate known variables to unknown variables. A random variable (RV) is an aspect of the world; the variable is denoted with a capital letter and the value is denoted with a lower-case letter. A probability distribution outlines the probability that a RV takes on a given value. Unobserved RVs have distributions; we can notate \(P(X = x) = p\). Distributions are denoted \(\forall x P(X = x) \geq 0\) and \(\sum_x P(X = x) = 1\).

A joint distribution specifies a real number for each assignment with the same constraints. For \(n\) variables with domain sizes \(d\), the size of the distribution will be \(d^n\). A probabilistic model is a joint distribution over a set of RVs with domains, outcomes (assignment), and normalized probability (sums to 1). An event is a set \(E\) of outcomes with \(P(E) = \sum_{(x_1, ... , x_n) \in E} P(x_1, ..., x_n)\). Marginal distributions are sub-tables that eliminate variables by summing them out. A conditional probability is defined as \(P(a \| b) = P(a, b) / P(b)\). Conditional distributions are probability distributions over some variables given fixed values of other variables.

Probabilistic inference involves computing a probability from known probabilities. In the general case, you have evidence variables \(E_1, ..., E_k = e_1, ..., e_k\), a query variable \(Q\), and hidden variables \(H_1, ..., H_r\). We want to evaluate \(P(Q \| e_1, ..., e_k)\).

The product rule dictates that \(P(y) P(x \| y) = P(x, y)\). More generally, the chain rule dictates that any joint distribution is the product of conditional distributions. Bayes rule lets us build conditional probabilities as \(P(x \| y) = \frac{P(y, \| x) P(x)}{P(y)}\). We can compute a posterior distribution by updating with Bayes rule.

Independence

Models are always simplifications, but some models are useful. Bayes nets are a type of model that let us reason about unknown variables given evidence. Two variables are independent if \(\forall x, y : P(x, y) = P(x) P(y)\), denoted \(X \indep Y\). Independence is a simplifying modeling assumption, since it makes computing joints easier. A RV \(X\) is conditionally independent of \(Y\) given \(Z\) if \(\forall x, y, z: P(x, z, y) = P(x \| z)\), denoted \(X \indep Y \| Y\).

Full joint distributions grow exponentially in size. Bayes nets (graphical models, “BN”) describe complex joint distributions with simple, local distributions. A set of nodes, each representing a variable \(X\), comprise a directed, acyclic graph. Each node is described by a distribution over each combination of its parent’s values. A BN encodes joint distributions as the product of local conditional distributions.

For a graph like \(X \to Y \to Z \to W\), we have \(P(x, y, z, w) = P(x)P(y\|x) P(z\|y) P(w \|z)\). An unintuitive independence relationship is \(W \indep Y \| X\). We can prove independence using tedious algebra, but disprove independence with a single example.

A triple is a group of 3 nodes. D-separation is an algorithm for analyzing the independence of triples. In a causal chain, \(X \to Y \to Z\), \(X \indep Z \| Y\). In common cause, \(X \leftarrow Y \to Z$4,\)X \indep Z | Y\(. In **common effect**,\)X \to Z \leftarrow Y\(,\)X \indep Y$.

Any BN can be broken down into the three cases. Two variables are independent if there are no active paths between them. A path is active if every component triple is active.

Given a BN structure, the d-separation algorithm builds a complete list of conditional independences. The topology of a BN limits which joint distributions can be encoded.

Inference

Inference in BN is equivalent to 3-SAT and NP-comoplete. Some orderings make computation more efficient. A polytree is a directed graph with no undirected cycle.

General variable elimination refers to eliminating rows corresponding to hidden variables. You instatiate a CPT with evidence. You find all factors corresponding a given hidden variable \(H\) and join/normalize the probabilities. Ordering matters: if you eliminate a variable with many dependents, when the maximum factor will be much larger.

The intermediate probability tables are called factors. In inference by enumeration, we start with the CPTs as the initial factors. We select known factors, eliminate the hidden variables, and then normalize Marginalizing early creates savings. Always start with factors that select the evidence.

The first type of factor is a joint distribution. There’s one entry for each joint. If one variable has taken on a value, say a fixed X=x, then it’ll sum to P(x). The number of capitals is the dimensionality, multiplying the dimensions of the capital variables.

The second type of factor is a single or family of conditionals. It could be $$P(Yx)\(for a fix\)x\(, or a family of conditionals\)P(Y, X)$$.

The final type of factor is the specified family, We fix the \(Y = y\) and get the table \(P(y, X)\).

In general, for a factor \(P(Y_1 ... Y_N \| X_1 ... X_M)\), any assigned variable is a dimension missing from the array.

Sampling

Variable elimination involves eliminating hidden variables and marginalizing hiddne variables. In the worst case, it’s still exponential in the size of the Bayes’ net.

Sampling is repeated simulation. Instead of calculating the model directly, we quickly ask the what if questions. In prior sampling, ** we sample the priors and ultimate generate the distribution. As we take more samples, it approaches the real probability; it is **consistent. For rejection sampling, we reject samples that are inconsistent with the evidence. The problem is that this is inefficient; instead we can use likelihood weighting, weighting the entire sample by the probability consistent with the evidence. But, we can’t fix upstream variables to make evidence more likely. Start with an arbitrary instantiation. Sample one variable at a time, but fix evidence.

Markov Models

The value of \(X\) at some time is the state. The parameters are the transition probabilities/dynamics, which specify the probability of transitions.

The first order Markov property is that the past and future are independent given the present. The chain is just a BN. A forward algorithm could compute: \(P(x_t) = \sum_{x_{t-1}} (P(x_t) \| x_{t-1}) P(x_{t-1})\) For most chains, the initial distribution matters less over time. We end up in a stationary distribution \(P_\infty\) which satisfies: \(P_\infty(X) = P_{\infty + 1}(X)\) Hidden Markov models have the additional property of the current observation being independent of other observations. We update our beliefs by filtering, tracking the distribution of \(B_t(X) = P_T(X_T \| e_1, ..., e_t)\). Usually \(B_!1\) is uniform, since we don’t have evidence. If we have evidence up to the state before, we update with \(B'(X_{t+1}) = \sum_{x_t} P(X' \| x_t) B(x_t)\). If we also receive new evidence, we update with \(B(X_{t+1}) ∝ x_{t+1} P(e_{t+1} \| X_{t+1}) B'(X_{t+1})\).

The forward algorithm lets us calculate \(B_t(X)\) given evidence at each time. We can compute: \(P(x_1 \| e_{1:t}) \propto P(e_t \| x_t) \sum_{x_{t-1}} P(x_t \| x_{t-1}) P(x_{t-1}, e_{1:t-1})\) Particle filtering gives an approximate solution in scenarios where the solution space is too big. Instead, we can track samples called particles. We represent \(P(X)\) with a list of \(N\) particles with \(N << \|X|\). Each particle is moved by sampling the transition model so that \(x' = \text{sample}(P(X'|x))\), similar to prior sampling. After making an observation \(e\), we weigh the samples according to the probability of that observation. To resample, we sample a new set of particles from the previous, weighted distribution of particles.

Decision Networks

A decision network lets us calculate expected utility for actions under uncertainty. To select an action, we instantiate all evidence, select all possible actions, calculate the posterior for all parents of a utility node, and pick the action that maximizes expected utility. A decision tree can be modeled as an outcome tree.

The value of perfect information is the gain in MEU from knowing information. \(VPI(E'|e) = (\sum_{e'} P(e'|e) MEU(e, e')) - MEU(e)\) VPI is nonnegative (information won’t hurt), nonadditive (information isn’t necessarily additive), and order-independent (the order of information doesn’t matter). There’s no “imperfect” information - if an observation is noisy, we can model it with another node.

POMDPs add a noisy measurement of the true state based on the observation \(O\), modeled by an observation function \(P(o \| s)\). Solving POMDPs requires a VPI agent that repeatedly calculates the marginal value of information.

Machine Learning

NaĂŻve Bayes

Examples of classification problems include the spam/ham problem, digit recognition, and a variety of commercial application. In NaĂŻve Bayes, we assume all features are independent effects of the label, where the model only specifies how each feature depends on the class: \(P(Y|F_{1} ... F_{n}) \propto P(Y) \prod_{i} P(F_{i} \| Y)\) To use NB, we need an inference method and local conditional probability tables, which are specified by parameters denoted by \(theta\).

The experimental cycle involves learning parameters on the training set, adjusting hyperparameters, and computing the accuracy of the test set. Accuracy is the fraction of instances predicted correctly. Overfitting refers to fitting a model to the training data without generalizing to future cases. To generalize, estimates need to be smoothed and regularized.

To estimate parameters, we use the empirical rate of the value to pick the estimate that maximizes the likelihood of the data. \(P_{ML}(x) = \frac{\text{count}(x)}{\text{total samples}}, L(x, \theta) = \prod_i P_\theta(x_i)\) In maximum likelihood estimation, the hypothesis space is comprised on binomial distributions. Taking the derivative of the likelihood yields that the empirical rate maximizes likelihood.

Smoothing gives us a way to deal with unseen events and avoid overfitting. In Laplace smoothing, we pretend that we’ve seen every outcome \(k\) extra times: \(P_{LAP, k}(x) = \frac{c(x) + k}{N + k|X|}\) The amount of smoothing is a hyperparameter that we tune after finding parameters. A baseline is a straw man procedure that measures how hard the task is and gives a comparison for other accuracy achievements. The confidence of a classifier represents how sure the classifier is in the classification.

Perceptrons

In a linear classifier, the inputs are feature values, each feature has a weight, and the sum is the activation. \(\text{activation}_w(x) = \sum_i w_i \cdot f_i(x) = w \cdot f(x)\) In binary decision rule, the model outputs +1 when the activation is positive and -1 otherwise. The decision boundary is a hyperplane orthogonal to the weight vector. To learn with a binary perceptron, we can adjust the weights when a prediction is incorrect. \(w = w + y^* \cdot f\)

If we have multiple classes, we introduce a weight vector \(w_y\) for each class and pick the prediction with the highest score: \(y = \arg \max_y w_y \cdot f(x)\) We train by subtracting weights from the wrong answer and adding weights to the right answer.

Separability refers to the existence of parameters that get the training set correct; in this case, there is convergence. The mistake bound is the number of mistakes related to the margin \(\delta\) and number of feature \(k\): \(\text{mistakes} < \frac{k}{\delta^2}\) If the data aren’t separable, weights might thrash. Averaging weights can help. Sometimes it can be a poor generalization, barely separating the solutions, leading to overfitting. We can get probabilistic decision boundaries with scoring. If the score is very positive, we want the probability is very positive, we want probability to be close to 1, and vice versa. The Sigmoid function models this: \(z = w \cdot f(x), \phi (z) = \frac{1}{1+ e ^{-z}}\) The new learning method is maximum likelihood estimation: \(\max _w ll(w) = \max_w \sum_i \log P(y^{(i)} \| x^{(i)} ; w), P(y^{(i)} = +1 \| x^{(u)} ; w) = \frac{1}{1 + e^{-w \cdot f(x^{(i)})}}\) Multiclass logistic regression evaluates scores in the same way.

Neural Networks

The simplest way to optimize weights is hill climbing. Move to the best neighbor until there isn’t another one. For multiclass logistic regression, there is a continuous space with infinite neighbors. Instead of randomly picking neighbors, gradient ascent updates in the uphill direction. For \(g(w_1, w_2, ...)\), we update: \(w \leftarrow w + \alpha * \Delta_w g(w), \Delta_wg(w) = [\frac{\partial g}{\partial w_1}(w), \frac{\partial g}{\partial w_2}(w)...]\) We can use the logistic regression function as \(g\), where each update adds \(f\) to the correct class weight.

One alternative is stochastic gradient ascent, which involves picking random samples. Another is mini-batch were the gradient over a batch is computed in parallel.

Multi-class logistic regression is a special case of neural networks. The difference is that a deep neural network learns the features. The value of node \(i\) in layer \(k\) is a weighted average of the previous layers’ neurons: \(z_i^{k} = g(\sum_j W{_i, j}^{(k-1, k)}) z_j^{(k-1)}\) Common activation fundtions are sigmoid, hyperbolic (\(\frac{e^z - e^{-z}}{e^z + e^{-z}}\)) and ReLU (\(\max(0, z)\)). A two-layer neural network with sufficient neurons can closely approximate any continuous function.

Backpropogation involves computing partial derivatives backwards. It gives the exact answer, in contrast to finite difference approximation. Applications of nueral nets include computer vision, visual QA, and machine translation. Neural nets are limited by shortcuts, using related information and particular features. In deep RL, coordination tasks can overfit and be sensitive to noise in collaborators.