• Biomedical concept creation
    • Database mining
      • Wiki has numerous data examples. A data example is often comprised of row captions and sample data in SDTM
      • Biomedical concepts are a formalism of clinical knowledge
      • Hypothesis: We can use the Wiki data examples to automate the creation of biomedical concepts
      • Key phrases: mining algorithms; turn data into knowledge
    • NLP
      • CDISC products, specifically TAUG, are authored by SMEs in the biopharma sector. Acronyms are prevalent in the sector. Because of that, particular linguistic patterns are apparent.
  • Learning health systems
    • ?
  • Characteristics of ML
    • Patience: Human vs. machine ability to learn without being explicitly programmed
    • Tom Mitchell: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with experience E. For example:
      • E: Process scenarios of clinical data collection, aggregation, and analysis
      • T: Map a scenario to standardized structures
      • P: Increased accuracy of data mapping
    • Learning algorithm categories
      • Supervised learning, where "right answers" are given
      • Unsupervised learning, for finding structure of data
      • Reinforcement learning
      • Recommender system
    • ML problem categories
      • Supervised learning
        • Regression to predict real-valued output
        • Classification is for discrete-valued output by features (attributes), e.g., age X tumor size, tumor cell morphology (shape, uniformity, etc.)
      • Unsupervised learning
        • Clustering to look for concentration of unlabeled (non-attributed) data points, hence clusters or segmentation
        • Cocktail party problem for using the right tool for complex algorithmic tasks, e.g., a one-liner in Octave <https://www.gnu.org/software/octave> for separating two conversations mixed in 1 soundtrack: [W,s,v] = svd((repmat(sum(x.*x,1), size(x,1), 1).*x).*x');
    • Linear regression
      • A form of supervised learning where the "right answers" are provided
      • Requires training set of data, where
        • m: number of training examples
        • x: input variable, or features
        • y: output variable, or target
        • (x, y): a set of training example
        • (x(i), y(i)): the ith training example
      • Training set > Learning algorithm > Hypothesis h
        • h is simply a function where it takes x as an input to provide y as an output
          • The term hypothesis was coined in the early days of AI and it stuck to become a general convention
        • Linear regression function with one variable (univariate linear regression)
          • hΘ(x) = Θ0 + Θ1x
          • h maps from x to y
          • Θi's denote parameters

          • In other words, it is another representation of linear equation in algebra y = mx + b, where m is the slope of the line and b is the y-intercept
          • Strategy for picking the coefficient for Θ0 and Θ1
            • Meaning, they are chosen so that hΘ(x) is close to y for the training examples (x, y)
            • To do that, use cost function J(Θ0, Θ1). Explanation from Beginner: Cost Function and Gradient Descent:
              Cost function basically means how much far away your predicted line is from the actual points that we were already given. In other words, you had some points already given to you, after that you predicted some value of Θ0 and Θ1, using that, you draws a line on the graph; after doing that you realize that new line don’t exactly touches upon all three data points you already had, so now you calculate how far away the original points and your predicted line is. And that you can calculate using cost function. The formula for that is as follows:
            • Breaking the formula down:
              • 0 means no cost, i.e., a hypothesis that predicts 100% correctly
              • Therefore, the objective is to minimize the cost, hence hΘ(x) - y should be as small as possible. In other words, the cost is a minimization between the prediction from the hypothesis hΘ(x) and the actual y. ReminderhΘ(x) refers to Θ0 + Θ1x
              • The 2nd power is being referred as "squared errors"
                • Other explanation: "Plot the function y = x^2. It's a parabola that opens upwards. That means there is a definite minimum value. For linear regression (and correlation), we have the same thing, except the formula is slightly more elaborate, but it basically boils down to a parabola opening upwards, and that minimum point is the estimate for the slope and y-intercept that we derive."
              • The summation i=1..m is to account for all training examples given in a set, hence the formula becomes Σ(i=1..m) hΘ(x(i)) - y(i)
              • ? What is the purpose of the constant ½m in the formula? An explanation: https://math.stackexchange.com/questions/884887/why-divide-by-2m
            • There are other diagnostics to determine linear fit, e.g., AIC, BIC, R-squared, adjusted R-squared
            • To approximate the slope of regression line using (x,y) training set:

              • negative covariance means variable X will increase as Y decreases, and vice versa, while a positive covariance means that X and Y will increase or decrease together.
          • Contour plots: Given a 3D object, slicing it along the z-axis with constant intervals. This will give a series of contour lines. https://www.khanacademy.org/math/multivariable-calculus/thinking-about-multivariable-function/visualizing-scalar-valued-functions/v/contour-plots
            • The center of a contour plot represents an optimal cost of hΘ(x), i.e., the most minimized cost.
          • Gradient descent for some cost function J(Θ0, Θ1) to find its minimum cost. It is an iterative algorithm.
            • Start with some Θ0, Θ1, often zeros
            • Keep changing Θ0, Θ1 to reduce J(Θ0, Θ1) until a minimum is reached



              With the partial derivatives applied:


            • α denotes learning rate
            • It is important to simultaneously update Θ0, Θ1 when implementing gradient descent, such that
            • Recall Sal Khan mentioned the error curve is a bowl shape (a.k.a. convex function), gradient descent algorithm pushes the parameters toward the minimum point, regardless of which side of the curve the initial Θ0, Θ1 values are.
            • The term "batch gradient descent" refers to when the algorithm uses the entire training set (x, y). There are other versions of gradient descent that only references a subset.
          • When we need to predict using multiple parameters Θ0, Θ1, ..., Θn, use matrices and vectors, such that:
            • All the parameters are arranged in a matrix X, where each column is a parameter
            • All the predictions y are arranged in a vector
          • For hΘ(x) = Θ0 + Θ1x, we can use matrix manipulation to make prediction a computationally efficient task:
            • Further, we can combine multiple hypotheses into the matrix manipulation:
      • Multivariate linear regression
        • Multiple features
        • x1, x2, ..., xn: input variable, or features
        • xj(i) denotes value of feature j in ith training example
        • Hypothesis is the inner product between the parameter vector theta and feature vector x:
        • The gradient descent algorithm becomes:
        • Recall x0 = 1, therefore x0(i) is also 1
        • Feature scaling is a trick to optimize the gradient descent algorithm so that it takes fewer steps to find the global minimum. This is especially useful when values in the parameter vector theta have a wide range. In this case, contours with high eccentricity (tall, skinny ellipses; or, short, wide ellipses) will occur, causing many zigzag movement before converging to the global minimum.
          • Ideally, -1 ≤ xi ≤ 1, or close enough to -1 or 1, such as 0 ≤ xi ≤ 3, or -2 ≤ xi ≤ 0.5. -0.001 ≤ xi ≤ 0.001 is also considered poorly scaled.
          • Further, mean normalization is defined as
        • Choosing a good initial learning rateα may significantly reduce # of iterations to reach convergence to the global optima. A good learning rateα candidate would yield an exponential decay when plotting a line graph J(Θ)  (y-axis) as a function of # of iteration (x-axis). Conversely, a curve showing exponential growth or wave form (sinusoidal functions) would indicate a non-optimal learning rateα because it is diverging from the global optima.
          • A strategy is to try 0.001, 0003, 0.006, 0.01, 003, 006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, a step-wise by a factor of 3 between 0.001 and 1.
      • Polynomial regression
  • Questions
    • Do we have a go at ML by ourselves? It will be great to have external advisers to give us suggestions at various checkpoints to minimize time waste, such as framing a well-posed learning problem, selecting a right approach, best practices, etc.
  • References
  • No labels