How can we apply machine learning to CDISC 360?

Biomedical concept creation
- Database mining
  - Wiki has numerous data examples. A data example is often comprised of row captions and sample data in SDTM
  - Biomedical concepts are a formalism of clinical knowledge
  - Hypothesis: We can use the Wiki data examples to automate the creation of biomedical concepts
  - Key phrases: mining algorithms; turn data into knowledge
- NLP
  - CDISC products, specifically TAUG, are authored by SMEs in the biopharma sector. Acronyms are prevalent in the sector. Because of that, particular linguistic patterns are apparent.
Learning health systems
- ?
Characteristics of ML
- Patience: Human vs. machine ability to learn without being explicitly programmed
- Tom Mitchell: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P, improves with experience E. For example:
  - E: Process scenarios of clinical data collection, aggregation, and analysis
  - T: Map a scenario to standardized structures
  - P: Increased accuracy of data mapping
- Learning algorithm categories
  - Supervised learning, where "right answers" are given
  - Unsupervised learning, for finding structure of data
  - Reinforcement learning
  - Recommender system
- ML problem categories
  - Supervised learning
    - Regression to predict real-valued output
    - Classification is for discrete-valued output by features (attributes), e.g., age X tumor size, tumor cell morphology (shape, uniformity, etc.)
  - Unsupervised learning
    - Clustering to look for concentration of unlabeled (non-attributed) data points, hence clusters or segmentation
    - Cocktail party problem for using the right tool for complex algorithmic tasks, e.g., a one-liner in Octave <https://www.gnu.org/software/octave> for separating two conversations mixed in 1 soundtrack: [W,s,v] = svd((repmat(sum(x.*x,1), size(x,1), 1).*x).*x');
- Linear regression
  - A form of supervised learning where the "right answers" are provided
  - Requires training set of data, where
    - m: number of training examples
    - x: input variable, or features
    - y: output variable, or target
    - (x, y): a set of training example
    - (x⁽ⁱ⁾, y⁽ⁱ⁾): the i^th training example
  - Training set > Learning algorithm > Hypothesis h
    - h is simply a function where it takes x as an input to provide y as an output
      - The term hypothesis was coined in the early days of AI and it stuck to become a general convention
    - Linear regression function with one variable (univariate linear regression)
      - h_Θ(x) = Θ₀ + Θ₁x
      - h maps from x to y
      - Θ_i's denote parameters
      - In other words, it is another representation of linear equation in algebra y = mx + b, where m is the slope of the line and b is the y-intercept
      - Strategy for picking the coefficient for Θ₀ and Θ₁
        Meaning, they are chosen so that h_Θ(x) is close to y for the training examples (x, y)
        To do that, use cost function J(Θ₀, Θ₁). Explanation from Beginner: Cost Function and Gradient Descent:
        Cost function basically means how much far away your predicted line is from the actual points that we were already given. In other words, you had some points already given to you, after that you predicted some value of Θ₀ and Θ₁, using that, you draws a line on the graph; after doing that you realize that new line don’t exactly touches upon all three data points you already had, so now you calculate how far away the original points and your predicted line is. And that you can calculate using cost function. The formula for that is as follows:
        Breaking the formula down:
        0 means no cost, i.e., a hypothesis that predicts 100% correctly
        Therefore, the objective is to minimize the cost, hence h_Θ(x) - y should be as small as possible. In other words, the cost is a minimization between the prediction from the hypothesis h_Θ(x) and the actual y. Reminder: h_Θ(x) refers to Θ₀ + Θ₁x
        The 2nd power is being referred as "squared errors"
        Other explanation: "Plot the function y = x^2. It's a parabola that opens upwards. That means there is a definite minimum value. For linear regression (and correlation), we have the same thing, except the formula is slightly more elaborate, but it basically boils down to a parabola opening upwards, and that minimum point is the estimate for the slope and y-intercept that we derive."
        The summation i=1..m is to account for all training examples given in a set, hence the formula becomes Σ(i=1..m) h_Θ(x⁽ⁱ⁾) - y⁽ⁱ⁾
        In statistics, it is called sum of squared errors. Wikipedia: https://en.wikipedia.org/wiki/Errors_and_residuals?section=5#Other_uses_of_the_word_%22error%22_in_statistics
        ? What is the purpose of the constant ½m in the formula? An explanation: https://math.stackexchange.com/questions/884887/why-divide-by-2m
        There are other diagnostics to determine linear fit, e.g., AIC, BIC, R-squared, adjusted R-squared
        Wolfram Alpha example: https://www.wolframalpha.com/input/?i=linear+fit+%7B1%2C2%7D%2C+%7B4%2C3%7D%2C+%7B-1%2C-1%7D%2C+%7B-2%2C-3%7D
        R-squared: 0 ≤ R ≤ 1, the closer to 1, the better fit it is.
        R-squared is also referred to coefficient of determination. Wikipedia: https://en.wikipedia.org/wiki/Coefficient_of_determination
        To approximate the slope of regression line using (x,y) training set:
        
        A negative covariance means variable X will increase as Y decreases, and vice versa, while a positive covariance means that X and Y will increase or decrease together.
      - Contour plots: Given a 3D object, slicing it along the z-axis with constant intervals. This will give a series of contour lines. https://www.khanacademy.org/math/multivariable-calculus/thinking-about-multivariable-function/visualizing-scalar-valued-functions/v/contour-plots
        The center of a contour plot represents an optimal cost of h_Θ(x), i.e., the most minimized cost.
      - Gradient descent for some cost function J(Θ₀, Θ₁) to find its minimum cost. It is an iterative algorithm.
        Start with some Θ₀, Θ₁, often zeros
        Keep changing Θ₀, Θ₁ to reduce J(Θ₀, Θ₁) until a minimum is reached
        
        With the partial derivatives applied:
        
        α denotes learning rate
        It is important to simultaneously update Θ₀, Θ₁ when implementing gradient descent, such that
        Recall Sal Khan mentioned the error curve is a bowl shape (a.k.a. convex function), gradient descent algorithm pushes the parameters toward the minimum point, regardless of which side of the curve the initial Θ₀, Θ₁ values are.
        That's the role of the derivatives in the algorithm: https://youtu.be/4SVqZaY55qo?t=181
        At the global optima, the derivative (slope) will be 0
        The term "batch gradient descent" refers to when the algorithm uses the entire training set (x, y). There are other versions of gradient descent that only references a subset.
      - When we need to predict using multiple parameters Θ₀, Θ₁, ..., Θn, use matrices and vectors, such that:
        All the parameters are arranged in a matrix X, where each column is a parameter
        All the predictions y are arranged in a vector
      - For h_Θ(x) = Θ₀ + Θ₁x, we can use matrix manipulation to make prediction a computationally efficient task:
        Further, we can combine multiple hypotheses into the matrix manipulation:
  - Multivariate linear regression
    - Multiple features
    - x₁, x₂, ..., x_n: input variable, or features
    - x_j⁽ⁱ⁾ denotes value of feature j in i^th training example
    - Hypothesis is the inner product between the parameter vector theta and feature vector x:
    - The gradient descent algorithm becomes:
    - Recall x₀ = 1, therefore x₀⁽ⁱ⁾ is also 1
    - Feature scaling is a trick to optimize the gradient descent algorithm so that it takes fewer steps to find the global minimum. This is especially useful when values in the parameter vector theta have a wide range. In this case, contours with high eccentricity (tall, skinny ellipses; or, short, wide ellipses) will occur, causing many zigzag movement before converging to the global minimum.
      - Ideally, -1 ≤ x_i ≤ 1, or close enough to -1 or 1, such as 0 ≤ x_i ≤ 3, or -2 ≤ x_i ≤ 0.5. -0.001 ≤ x_i ≤ 0.001 is also considered poorly scaled.
      - Further, mean normalization is defined as
    - Choosing a good initial learning rateα may significantly reduce # of iterations to reach convergence to the global optima. A good learning rateα candidate would yield an exponential decay when plotting a line graph J(Θ) (y-axis) as a function of # of iteration (x-axis). Conversely, a curve showing exponential growth or wave form (sinusoidal functions) would indicate a non-optimal learning rateα because it is diverging from the global optima.
      - A strategy is to try 0.001, 0003, 0.006, 0.01, 003, 006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, a step-wise by a factor of 3 between 0.001 and 1.
  - Polynomial regression
Questions
- Do we have a go at ML by ourselves? It will be great to have external advisers to give us suggestions at various checkpoints to minimize time waste, such as framing a well-posed learning problem, selecting a right approach, best practices, etc.
References
- The Discipline of Machine Learning. Tom M. Mitchell. Carnegie Mellon University. 2006. http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
- Squared error of regression line. Sal Khan. Khans Academy. https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/squared-error-of-regression-line

Page tree

How can we apply machine learning to CDISC 360?