Wiki has numerous data examples. A data example is often comprised of row captions and sample data in SDTM
Biomedical concepts are a formalism of clinical knowledge
Hypothesis: We can use the Wiki data examples to automate the creation of biomedical concepts
Key phrases: mining algorithms; turn data into knowledge
NLP
CDISC products, specifically TAUG, are authored by SMEs in the biopharma sector. Acronyms are prevalent in the sector. Because of that, particular linguistic patterns are apparent.
Learning health systems
?
Characteristics of ML
Patience: Human vs. machine ability to learn without being explicitly programmed
Tom Mitchell: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with experience E. For example:
E: Process scenarios of clinical data collection, aggregation, and analysis
T: Map a scenario to standardized structures
P: Increased accuracy of data mapping
Learning algorithm categories
Supervised learning, where "right answers" are given
Unsupervised learning, for finding structure of data
Reinforcement learning
Recommender system
ML problem categories
Supervised learning
Regression to predict real-valued output
Classification is for discrete-valued output by features (attributes), e.g., age X tumor size, tumor cell morphology (shape, uniformity, etc.)
Unsupervised learning
Clustering to look for concentration of unlabeled (non-attributed) data points, hence clusters or segmentation
Cocktail party problem for using the right tool for complex algorithmic tasks, e.g., a one-liner in Octave <https://www.gnu.org/software/octave> for separating two conversations mixed in 1 soundtrack: [W,s,v] = svd((repmat(sum(x.*x,1), size(x,1), 1).*x).*x');
Linear regression
A form of supervised learning where the "right answers" are provided
Requires training set of data, where
m: number of training examples
x: input variable, or features
y: output variable, or target
(x, y): a set of training example
(x(i), y(i)): the ith training example
Training set > Learning algorithm > Hypothesis h
h is simply a function where it takes x as an input to provide y as an output
The term hypothesis was coined in the early days of AI and it stuck to become a general convention
Linear regression function with one variable (univariate linear regression)
hΘ(x) = Θ0 + Θ1x
h maps from x to y
Θi's denote parameters
In other words, it is another representation of linear equation in algebra y = mx + b, where m is the slope of the line and b is the y-intercept
Strategy for picking the coefficient for Θ0 and Θ1
Meaning, they are chosen so that hΘ(x) is close to y for the training examples (x, y)
To do that, use cost function J(Θ0, Θ1). Explanation from Beginner: Cost Function and Gradient Descent: Cost function basically means how much far away your predicted line is from the actual points that we were already given. In other words, you had some points already given to you, after that you predicted some value of Θ0 and Θ1, using that, you draws a line on the graph; after doing that you realize that new line don’t exactly touches upon all three data points you already had, so now you calculate how far away the original points and your predicted line is. And that you can calculate using cost function. The formula for that is as follows:
Breaking the formula down:
0 means no cost, i.e., a hypothesis that predicts 100% correctly
Therefore, the objective is to minimize the cost, hence hΘ(x) - y should be as small as possible. In other words, the cost is a minimization between the prediction from the hypothesis hΘ(x) and the actual y. Reminder: hΘ(x) refers to Θ0 + Θ1x
The 2nd power is being referred as "squared errors"
Other explanation: "Plot the function y = x^2. It's a parabola that opens upwards. That means there is a definite minimum value. For linear regression (and correlation), we have the same thing, except the formula is slightly more elaborate, but it basically boils down to a parabola opening upwards, and that minimum point is the estimate for the slope and y-intercept that we derive."
The summation i=1..m is to account for all training examples given in a set, hence the formula becomes Σ(i=1..m) hΘ(x(i)) - y(i)
To approximate the slope of regression line using (x,y) training set:
A negative covariance means variable X will increase as Y decreases, and vice versa, while a positive covariance means that X and Y will increase or decrease together.
The center of a contour plot represents an optimal cost of hΘ(x), i.e., the most minimized cost.
Gradient descent for some cost function J(Θ0, Θ1) to find its minimum cost. It is an iterative algorithm.
Start with some Θ0, Θ1, often zeros
Keep changing Θ0, Θ1 to reduce J(Θ0, Θ1)until a minimum is reached
With the partial derivatives applied:
α denotes learning rate
It is important to simultaneously update Θ0, Θ1 when implementing gradient descent, such that
Recall Sal Khan mentioned the error curve is a bowl shape (a.k.a. convex function), gradient descent algorithm pushes the parameters toward the minimum point, regardless of which side of the curve the initial Θ0, Θ1 values are.
At the global optima, the derivative (slope) will be 0
The term "batch gradient descent" refers to when the algorithm uses the entire training set (x, y). There are other versions of gradient descent that only references a subset.
When we need to predict using multiple parameters Θ0, Θ1, ..., Θn, use matrices and vectors, such that:
All the parameters are arranged in a matrix X, where each column is a parameter
All the predictions y are arranged in a vector
For hΘ(x) = Θ0 + Θ1x, we can use matrix manipulation to make prediction a computationally efficient task:
Further, we can combine multiple hypotheses into the matrix manipulation:
Multivariate linear regression
Multiple features
x1, x2, ..., xn: input variable, or features
xj(i) denotes value of feature j in ith training example
Hypothesis is the inner product between the parameter vector theta and feature vector x:
The gradient descent algorithm becomes:
Recall x0 = 1, therefore x0(i) is also 1
Feature scaling is a trick to optimize the gradient descent algorithm so that it takes fewer steps to find the global minimum. This is especially useful when values in the parameter vector theta have a wide range. In this case, contours with high eccentricity (tall, skinny ellipses; or, short, wide ellipses) will occur, causing many zigzag movement before converging to the global minimum.
Ideally, -1 ≤ xi ≤ 1, or close enough to -1 or 1, such as 0 ≤ xi ≤ 3, or -2 ≤ xi ≤ 0.5. -0.001 ≤ xi ≤ 0.001 is also considered poorly scaled.
Further, mean normalization is defined as
Choosing a good initial learning rateα may significantly reduce # of iterations to reach convergence to the global optima. A good learning rateα candidate would yield an exponential decay when plotting a line graph J(Θ) (y-axis) as a function of # of iteration (x-axis). Conversely, a curve showing exponential growth or wave form (sinusoidal functions)would indicate a non-optimal learning rateα because it is diverging from the global optima.
A strategy is to try 0.001, 0003, 0.006, 0.01, 003, 006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, a step-wise by a factor of 3 between 0.001 and 1.
Polynomial regression
Questions
Do we have a go at ML by ourselves? It will be great to have external advisers to give us suggestions at various checkpoints to minimize time waste, such as framing a well-posed learning problem, selecting a right approach, best practices, etc.