1. The original intention of PhaSePred

Liquid-liquid phase separation (LLPS) is an important mechanism that mediates the compartmentalization of proteins in cells. Multivalent weak interactions between these molecules are the driving force of LLPS. The interactions can generally be classified into two categories: one mediated by intrinsically disordered regions (IDRs) and the other mediated by multiple modular domains or motifs. A difference between the two phase separation mechanisms is that a single species can undergo IDR-mediated phase separation, while phase separation mediated by multiple interacting domains often involves two or more different protein species. Herein, we characterize proteins that can self-assemble to form condensates as self-assembling phase-separating proteins (LLPS-Self), and we define proteins whose phase separation behaviors are regulated by partner components as partner-dependent phase-separating proteins (LLPS-Part).
Currently, there is no computational tool that can identify partner-dependent phase-separating proteins. However, most phase-separating systems involve multiple partners in biological conditions. A similar pattern was displayed in the annotations collected from PhaSepDB with more LLPS-Part than LLPS-Self proteins. Therefore, the screening of potential partner-dependent proteins is of great importance.
Sequence-based analysis tools are commonly used to screen phase-separating proteins. However, current LLPS predictors recognize vastly different kinds of proteins because they were developed to screen various sequence features, calling for the development of a comprehensive predictor.
Herein we build PhaSePred, a meta-predictor that incorporates prediction scores of multiple LLPS-related predictors, including the self-assembling phase-separating predictor SaPS, the partner-dependent phase-separating predictor PdPS, granule-forming propensity predictor catGRANULE, prion-like domain (PLD) predictor PLAAC, π-contact predictor PScore, IDR predictor ESpritz, low-complexity region (LCR) predictor SEG, hydropathy prediction from CIDER, coiled-coil domain predictor DeepCoil, immunofluorescence (IF) image-based droplet-forming propensity predictor DeepPhase.

2. Construction of self-assembling and partner-dependent phase-separating predictors


The process of constructing SaPS and PdPS models can be divided into three steps:
1) Data collection:
Proteins used to train the self-assembling and partner-dependent phase-separating predictors were collected from PhaSepDB 2.0. The annotated proteins in PhaSepDB 2.0 were divided into two groups: PS-self and PS-other. 'PS-self' refers to those proteins that can undergo self-assembling phase separation in vitro. 'PS-other' refers to those proteins contributing to the formation of biomolecular condensates. If a protein participates in an MLO with partner components, its partners will be recorded in the 'Partner' column. We selected proteins labeled 'PS-self' as self-assembling phase-separating proteins, and selected proteins with the protein partner as our defined partner-dependent proteins.
2) Feature calculation:
Features with batch prediction or provide predicted results were used to train the model. We used the PLAAC NLLR score to indicate prion-like propensity, the catGRANULE score to indicate granule-formation propensity, the proportion of disordered and low-complexity residues to reflect IDR and LCR propensity, the DeepCoil to detect potential coiled-coil structures, and used the hydropathy score, the fraction of charged residues (FCR), and PScore to illustrate multivalent interactions including hydrophobicity, electrostatic interactions, and π-π stacking. Sequence-irrelevant features such as IF image and Phosphorylation frequency were used to train the human models.
3) Model construction:
Incorporating the features mentioned above, we used the Python package of XGBoost to construct models for predicting self-assembling phase-separating (SaPS) and Partner-dependent phase-separating (PdPS) proteins, respectively. Models with 8 features (Hydropathy, FCR, IDR, LCR, PScore, PLAAC, catGRANULE, DeepCoil) are built on all-species data, while models with 10 features (the 8 features described above plus Phos frequency and DeepPhase) are only built on the human proteome.

3. Data distribution

Currently, PhaSePred contains UniProt proteins from 18 well-studied species, including Homo sapiens (human), Mus musculus (mouse), Arabidopsis thaliana (mouse-ear cress), Rattus norvegicus (rat), Saccharomyces cerevisiae (baker's yeast), Escherichia coli (strain K12 / DH10B), Bos taurus (bovine), Schizosaccharomyces pombe (strain 972 / ATCC 24843) (fission yeast), Bacillus subtilis (strain 168), Dictyostelium discoideum (slime mold), Caenorhabditis elegans, Oryza sativa subsp. japonica (rice), Drosophila melanogaster (fruit fly), Xenopus laevis (African clawed frog), Danio rerio (zebrafish) (Brachydanio rerio), Gallus gallus (chicken), Sus scrofa (pig), and Canis lupus familiaris (dog) (Canis familiaris).
As of August 2021, integrated predictions for 116,806 sequences were available.


4. Prediction of LLPS-related features

We provided residue-level scores and/or annotations of eleven LLPS-related features:
1) The domain information was annotated by the local package of InterProScan with the following command:
./interproscan.sh -i /inputFilePath/source.fasta -o /outputFilePath/target.tsv -f tsv
We only kept the annotation information from the Pfam database.
2) The granule-forming propensity was predicted by the web server of catGRANULE with default parameters.
3) The prion-like domains (PLDs) were predicted by the local package of PLAAC with the following command:
java -jar ./plaac.jar -i /inputFilePath/source.fasta -a 1 -p all
PLAAC provides three summary scores for a given sequence, including LLR, CORE, and PRD. Since the LLR score is more appreciated in whole-proteome screening, we used the normalized LLR score (NLLR) to represent the PLD-forming propensity.
4) The π-contact propensities were predicted by the local package of PScore with the following command:
./PScore.py /inputFilePath/source.fasta -output /outputFilePath/target.out -mute -overwrite -residue_scores -score_components
5) The intrinsically disordered regions (IDRs) were predicted by the local package of ESpritz with the following command:
./espritz.pl /inputFilePath D 0
We set the prediction type at 'Disprot' and the decision threshold at '5% false-positive rate (FPR)'.
6) The residue hydropathy was predicted by the Python package of localCIDER. We used the function 'get_linear_hydropathy()' with the default parameter, which returns the local hydropathy with the Kyte-Doolite hydropathy scale.
7) The coiled-coil (CC) structures were predicted by the Python package of DeepCoil with GPU acceleration. We used the column 'cc' to represent the coiled-coil propensity, which ranges from 0 to 1.
8) The low-complexity regions (LCRs) were predicted by the local package of SEG with the following command:
./seg /inputFilePath/source.fasta -x
9) The fraction of charged residues (FCR) was counted using the Python package of localCIDER. We used the function 'get_FCR()' with the default parameter, which assumes a neutral pH where only R/K/D/E are charged.
10) The Phosphorylation sites were collected from the PhosphoSitePlus (fetched Sep 8, 2020). This property was only available on the Human proteome.
11) The immunofluorescence (IF) image-based droplet-forming propensity was collected from DeepPhase. This property was only available on the Human proteome.

5. Related links

PhaSepDB The database of phase-separation related proteins
MloDisDB A manually curated DataBase of the relations between MembraneLess Organelles and DISeases
PhaSePro PhaSePro is the comprehensive database of proteins driving liquid-liquid phase separation (LLPS) in living cells
DrLLPS Data resource of liquid-liquid phase separation
LLPSDB A database of proteins undergoing liquid-liquid phase separation in vitro
PLAAC PLAAC searches protein sequences to identify probable prion subsequences using a hidden-Markov model (HMM) algorithm.
PSPer Unsupervised prediction of proteins able to form phase-separated liquid droplets acting as membraneless organelles.
Pscore Pi-Pi contacts are an overlooked protein feature relevant to phase separation.
catGRANULE catGRANULE is an algorithm to predict liquid-liquid phase separation propensity (LLPS).
R+Y Critical concentration prediction based on number of arginine and tyrosine residues, extrapolated from FET family proteins.
MobiDB a database of protein disorder and mobility annotations.
DeepCoil a fast and accurate prediction of coiled-coil domains in protein sequences.
localCIDRE resources to analyze sequence-ensemble relationships of intrinsically disordered proteins.
Espritz Accurate and fast prediction of protein disorder.
SEG Statistics of local complexity in amino acid sequences and sequence databases.
PhosphoSitePlus provides comprehensive information and tools for the study of protein post-translational modifications (PTMs) including phosphorylation, acetylation, and more.
DeepPhase Proteome-scale analysis of phase-separated proteins in immunofluorescence images.
The Human Protein Atlas The Human Protein Atlas is a Swedish-based program initiated in 2003 with the aim to map all the human proteins in cells, tissues and organs using integration of various omics technologies.