Distance Correlation Measures Applied to
Analyze Relation between Variables in Liver
Cirrhosis Marker Data

Santam Chakraborty; Atanu Bhattacharjee

Distance Correlation Measures Applied to Analyze Relation between Variables in Liver Cirrhosis Marker Data

Santam Chakraborty and Atanu Bhattacharjee

Santam Chakraborty¹ and Atanu Bhattacharjee^2*

Department of Radiation Oncology, Tata Memorial Hospital, Mumbai, India
Division of Clinical Research and Biostatistics, Malabar Cancer Centre, Thalassery, India Malabar Cancer Centre, Thalassery, Kerala-670103, India,

Corresponding Author: Atanu Bhattacharjee,Division of Clinical Research and Biostatistics, Malabar Cancer Centre, Thalassery, India Malabar Cancer Centre, Thalassery, Kerala-670103, India. E-mail: atanustat@gmail.com

Received: November 13, 2015; Accepted: November 24, 2015;11 February 2015 Published: November 30, 2015

Visit for more related articles at International Journal of Collaborative Research on Internal Medicine & Public Health

Abstract

Distance correlation (DC) is a new choice to compute the relation between variables. However, the Bayesian counterpart of Distance Correlation is not well established. In this paper, a Bayesian counterpart of Distance Correlation is proposed. The proposed method is illustrated with Liver Cirrhosis Marker data. Previously published data on the relation between aspartate transaminase (AST) and alanine transaminase (ALT) is used to formulate the prior information for Bayesian computation. The computed DC using the proposed method between AST and ALT (both of which are markers of liver function) is 0.44. The credible interval is ranges 0.41 to 0.46. Bayesian counterpart proposed herein to compute DC coefficient is simple and handy.

Introduction

The statistical dependence between two random vectors (irrespective of the measurement dimension) can be measured by distance correlation (DC) [1,2]. DC ranges between 0−1 with 0 indicating that the vectors are completely in- dependant statistically. As a generalized form of Pearson correlation it provides a method to measure multivariate independence. Szekely et al have shown that it is consistent for all dependent alternatives through finite second moments [3]. The bias outcome of DC through different dimensions are also tested [3]. The unbiased T test is considered suitable for testing the independence of variables using distance correlation.

The use of DC has been extended for high dimensional data [4]. The application of DC for functional data has also been extended recently through Hilbert space [5]. Recently, several new tools are available to the scientific community for more complex issue through Cannonical (consideration of linear combinations between variables through maximum correlation with each other), Rank and Renyi correlation (through observing the cosine angle between the linear subspaces of mean zero square integral real-valued random variables from individual random variable)[4]. However, all of them having some advantages and limitations [6]. The joint independences of random variable can be explored through DC [2]. It is a matrix inversion free approach.Dependences measurement between two random variables can be observed and tested through matrix inversion free approach [7]. Experimental and observational studies in clinical medicine usually rely on exploring the relation between two variables of interest (for example understanding how high blood pressure and increased total cholesterol in serum are related with each other to predict the risk of myocardial infarction). The objective of this present study is to demonstrate the use of a Bayesian approach to DC and formulate a methodology for calculation. The method is then illustrated with clinical trial example.

Distance Covariance and Distance Correlation

Distance covariance between the random variables X and Y is defined with marginal characteristic function of f_Y (t) and f_Y (s) by,

(1)

The function f (X, Y) is joint characteristics function of X and Y. The terms s and t are the vectors and the product of t and s is < t, s >. The distance covariance measures the distance ||f(X, Y) (t, s) − f(X) f(Y) (s)|| between the joint characteristic function and marginal characteristics function. The random vector X and Y are in Rp and Rq respectively. The hypothesis is H0: f_X,Y = f_X f_Y and H1 : f_X,Y f_X? f_Y The distance variance is

(2)

DC between X and Y is defined with finite first moments R(X, Y) by

(3)

The distance covariance V_n (X, Y) is defined with

(4)

Similarly it can be defined as:

(5)

The parameters are

,and

(6)

Similarly, BkL is defined.

Properties

The DC provides the scope to generalize the correlation between variables (X and Y) by R. It is defined on arbitrary dimensions R=0 for independent of X and Y. The range of DC is 0 < R < 1. The R can be defined as the function of Pearson correlation coefficient ρ with R(X,Y) < |ρ(X,Y)| with equality when ρ ± 1. The random variables X and Y are expressed as A_i=X_i+ÃÆÃÂÃâÃÂµ_i and B_j=Y_j+ÃÆÃÂÃâÃÂµ_j respectively. The error terms ÃÆÃÂÃâÃÂµ_i and ÃÆÃÂÃâÃÂµ_j are independent with the variables X_i and Y_j. Let the relation between random functions A_i and B_j is irrelevant. But the relation between X_i and Y_j is importance and matter of concerned. The strength of relation between X and Y can be measured through DC in this scenario.

In One-sided Test

The frequency approach test the problem through p(X) value of the null hypothesis H0. In contrast, Bayesian measures through posterior probability p (H0 |X). Let the data follows normal distribution (θ, σ2) with null hypothesis H0: θ ≤ 0 and H1: θ>0. The frequency and robust Bayesian often coincide [8]. Let the marginal DC ρ is applied between p(X)=1−Φ(X/σ) and p (H0 |X). The DC should be greater than or equal to zero. Because p(X) and p (H0 |X) both are decreasing with respect to X.

Parameter and Unbiased Estimator

Suppose, (θ, X) are the random variables with joint characteristics function f(X, Y) (t, s) and marginal distribution of θ is π. The estimator of θ is δ(X) and square error loss is r(π, δ) = E[δ(X) − θ]2 and risk is δπ(X ) = E(θ/X ).The DC between θ and δ(X) is

(7)

Method

The Bayes’ Theorem provides the prior information about the relevant parameter for the specific statistical analysis. It is helpful to test the hypothesis in presence of posterior probability of the parameter of interest. The parameter of interest R(X, Y) can be computed with posterior probability through Bayes’ theorem

(8)

The term P (R(X,Y) is the prior probability of R(X,Y) observed from the previous study. The term P (information/R(X,Y) is likelihood of R(X,Y) occurred in the previous study or data collected by the investigator. The sum of the function 1 should be equal to 1 as the theory of total Bayes theorem. The relation between posterior and prior is

PosteriorProbability α Likelihood x PriorProbability (9)

The posterior density of R(X,Y) is generated with

(10)

Let the mean and variance of X and Y are μ1, μ2, σ²₁, σ²₂ respectively. The mean (z) is derived from

(11)

The term R(X, Y) is defined by tanhÃÆÃÂÃâÃÂµ and it is assumed ÃÆÃÂÃâÃÂµ

.The mathematical formulations are detailed in Fisher (1915). The hyperbolic trans- formation plays role to consider the conjugate prior with normal distributions.

The posterior mean can be represented with

(12)

(13)

The prior with the form

(14)

The prior is dependent on the choice of c. The c=0 gives the P (R(X,Y) ∝ 1 the specification of prior is important for testing the parameters in hypothesis H0 and H1. The main focus of research in Bayesian approach is the specification of prior. The prior specification is carried out through regression modeling. Let the response of interest (Y ), covariates (X ), error(ÃÆÃÂÃâÃÂµ) and intercept (α) are in regression line through

(15)

Zellner (1986) has introduced the g prior for the above mentioned β coefficient. However, it is the extension of Jeffrey’s prior on the error precision ÃÆÃÂÃâÃ¢â¬Â¢ with uniform prior of interest α by

(16)

The information about β can be obtained through ÃÆÃÂÃâÃ¢â¬Â¢−1 (X T X)−1. Further, specified value of g gives the exposure about observed data. The specified value of g = 1 says no influences of observed data. Whereas, g=5 gives 15 weight as the observed data. The selection of value of g is very important [9]. It is considered as g=n. n is the sample size. Discussed to consider g=k. (k is the number of parameters). There are several literatures about selection of g prior. The work is contributed with Jeffrey’s-Zellner-sion (JZS) prior for g-value. It was represented by Liang and his colleagues [10] and applied for correlation coefficient [11]. The prior is like

(17)

(18)

M₁ :Y =α +β X +ε (19)

The above mentioned formula is also useful to calculate Bayes factor. The prior is applied as default prior for t-test [7]. The Bayesian factor is applied through JZS for DC in regression line. The regression coefficient β is allowed to the application JZS prior. Our goal is to compute DC, Intercept (α), regression coefficient (β) and error term (ÃÆÃÂÃâÃÂµ) s detailed in equation (1). Let the equation (1) further separated into Model (M1) and Model (M0) by

M₁ :Y =α +β X +ε (20)

M₀ :Y =α +ε (21)

The model ((M₁)) states the presence of DC and absence of it by Model ((M₀ )).

Now, the Bayes Factor through JZS is defined [10] as,

(22)

(23)

If the value of BF10 becomes more than 1, it state about presences of DC otherwise not.

Testing

Under the null hypothesis H0, the model (M0) is assumed and (M1) for alternative one i.e., H1. The prior probability of null is assigned as p (M0) and alternative as p (M1). Thereafter, Baye’s theorem is applied on the observed data to compute posterior probability of the hypothesis. The appearance of posterior probability of alternative hypothesis is computed as

(24)

The term P (Y |M1) is the marginal likelihood of the data for alternative hypothesis. Further, the marginal likelihood is calculated as

(25)

Bayes Factor [12] is useful to compute the appearance of P (M1|Y) in comparison to P (M0 |Y):

(26)

Illustrated Example

Aminotransferases are serum enzymes which are used to detect malfunction of liver, heart, lung, skeletal muscles and brain [13]. Among the aminotransferases, alanine and aspartate aminotransferase (ALT and AST respectively) are routinely measured to assess liver function [14]. Kumar et al have recently published the normal range of serum AST and ALT in over 5000 Indian blood donors and have proposed normal limits for healthy population [15]. In this ex- ample we illustrate the use of DC between AST and ALT measurements in the same individuals. The generated information between AST and ALT is used as prior information of sample size of 4917 individuals [15]. The raw data on AST and ALT of 606 individuals are detailed [16]. In both the above mentioned study, the relation between Serum alanine aminotransferase (ALT) and serum aminotransferase (AST) are observed. The relations between variables are explored through distance covariance with Bayesian approach. The first relation between ALT and AST is observed [15]. The measured distance correlation data is observed with error. Bayesian posterior estimate is computed for robust DC between ALT and AST by,

(27)

(28)

(29)

(30)

i.e. (0.41, 0.46). It shows the posterior estimates of DC i.e., R(X,Y) is 0.44 with credible interval (0.41, 0.46). This simple approach for DC can be extended in other experimental research. The posterior computed mean is 0.44 and sample size 606. The values are applied to obtain the BF10 in equation (23). The BF10 is calculated with 8.3. It is the evidence in favour of M1 in comparison to model M0. The presence of DC is tested through g prior.

Discussion

Recently, the testing process to check the presences of DC has been at- tempted. The t-test is found suitable to test the presence of DC. The relevant factors are proposed to perform it [3]. The evaluation of direct relation between two variables is important. Pearson and Spearman correlations are commonly applied tools to explore relation between variables. The strength of relation between variable can be classified by Cannonical, Rank and Renyi Correlation [4]. The widely explored correlation tool- Pearson correlation fails in multivariate data set. It becomes zero for independent bivariate normal distribution. But it failed to specify multivariate dependence in general. The limitation can be overcome by joint independence of the random variable through DC. The DC is product-moment correlation and generalized form of bivariate measures of dependency. It is very much useful and unexplored area for statistical inference. The idea of this work is to establish the application of new types of correlation tools for measurement of dependence between variables. It is more applicable for complicated multivariate data. The detailed application DC is recently established [2]. There are several advantages for application of DC over simple. The Bayesian application on DC computation has been elaborated [6]. But, the application of g-prior of DC testing is completely new. It is general tendency to avoid the prior information about the relation between variable. The Bayesian gives the scope to consider the prior information of the relation between variables to explore the strength of current relation between variables. The application of Bayesian to compute DC is illustrated and Hypothesis test statistics through Bayes Factor is detailed on Biochemical marker for liver performance. The work is illustrated with the estimation of DC between AST and ALT. It is dedicated for Bayesian test to compute DC. The simple method proposed can be used by researchers exploring the use of DC in their research work. This work is not an attempt to develop a new statistical model. But it is an effort to explore the application of Bayesian approach to compute DC. The application is illustrated with biomarker of liver cirrhosis observed through clinical trial data analysis. Bayesian can be useful to get prominent evidence for test statistics on relation between variables. Bayes factor is useful for computation of DC. It is useful to figure out the strength of hypothesis. It can be considered as easily interpretable tool to discover the relations. This illustrated tool can be widely accepted for future research to explore relation between variables.

Competing Interest

None Declared

References

G. J. Sz´ekely, M. L. Rizzo, et al., (2009)Brownian distance covariance, The annals of applied statistics 3 (4) 1236–1265.
G. J. Sz´ekely, M. L. Rizzo, N. K. Bakirov, et al.,(2007) Measuring and testing dependence by correlation of distances, The Annals of Statistics 35 (6) 2769–2794.
G. J. Sz´ekely, M. L. Rizzo,(2013) The distance correlation t-test of independence in high dimension, Journal of Multivariate Analysis 117 193–213.
A. Gretton, K. Fukumizu, B. K. Sriperumbudur, (2009) Discussion of: Brownian distance covariance, The annals of applied statistics 1285–1294.
R. Lyons, et al., (2013) Distance covariance in metric spaces, The Annals of Prob- ability 41 (5) 3284–3305.
A. Bhattacharjee,(2014) Distance correlation coefficient: An application with bayesian approach in clinical data analysis, Journal of Modern Applied Statistical Methods 13 (1) 23.
J. R. Blum, J. Kiefer, M. Rosenblatt, (1961) Distribution free tests of independence based on the sample distribution function, The annals of mathematical statistics 485–498.
G. Casella, R. L. Berger,(2002) Statistical inference, Vol. 2, Duxbury Pacific Grove, CA.
E. George, D. P. Foster,(2000) Calibration and empirical bayes variable selection, Biometrika 87 (4) 731–747.
F. Liang, R. Paulo, G. Molina, M. A. Clyde, J. O. Berger, Mixtures of g priors for bayesian variable selection, Journal of the American Statistical Association 103 (481).
T. W. Anderson, T. W. Anderson, T. W. Anderson, T. W. Anderson,(1958) An introduction to multivariate statistical analysis, Vol. 2, Wiley New York.
J. O. Berger, L. R. Pericchi,(1996) The intrinsic bayes factor for model selection and prediction, Journal of the American Statistical Association 91 (433) 109–122.
REITMAN S, FRANKEL S (1957) A colorimetric method for the determination of serum glutamic oxalacetic and glutamic pyruvic transaminases. Am J ClinPathol 28: 56-63.
WROBLEWSKI F (1959) The clinical significance of transaminase activities of serum. Am J Med 27: 911-923.
Kumar S, Amarapurkar A, Amarapurkar D (2013) Serum aminotransferase levels in healthy population from western India. Indian J Med Res 138: 894-899.
H. Southworth, J. E. Heffernan,(2012) Extreme value modelling of laboratory safety data from clinical studies, Pharmaceutical statistics 11 (5) 361–366.