7th International Conference on Big Data Analysis and Data Mining |

Value Added Abstracts

Pages: 1 - 1

Past Conference Report on Data Mining 2020

Fionn Murtagh

Share this article

Conference Series LLC Ltd hosted the “Data Mining”, during July 17-18, 2020 Webinar with the theme, “Knowledge discovery in databases: Step towards recovering economy after the pandemic: Covid-19”, which was a great success. Eminent keynote speakers from various reputed institutions and organizations addressed the gathering with their resplendent presence.

We extend our grateful thanks to all the momentous speakers, conference attendees who contributed towards the successful run of the conference.

Data Mining 2020 witnessed an amalgamation of peerless speakers who enlightened the crowd with their knowledge and confabulated on various latest and exciting innovations in all areas of Big Data Analysis and Data Mining.

Data Mining Organizing Committee extends its gratitude and congratulates the Honorable Moderators of the conference.

Conference Series LLC Ltd extends its warm gratitude to all the Honorable Guests and Keynote Speakers of “Data Mining 2020”.

Fionn Murtagh, University of Huddersfield, UK

Conference Series LLC Ltd is privileged to felicitate Data Mining 2020 Organizing Committee, Keynote Speakers, Chairs & Co-Chairs and also the Moderators of the conference whose support and efforts made the conference to move on the path of success. Conference Series LLC LTD thanks every individual participant for the enormous exquisite response. This inspires us to continue organizing events and conferences for further research in the field of Cell and Gene Therapy.

Conference Series LLC Ltd is glad to announce its “webinar on Webinar on 8^th International Conference on Big Data Analysis and Data Mining We cordially welcome all the eminent researchers, Data Scientists, Data Engineers, and Faculties, Data Mining experts, students and delegates to take part in this upcoming conference to witness invaluable scientific discussions and contribute to the future innovations in the field of Data Mining with 20% abatement on the Early Bird Prices.

Bookmark your dates for “Webinar on Data Mining 2021” as the Nominations for Best Poster Awards and Young Researcher Awards are open across the world.

Value Added Abstracts

Pages: 2 - 2

E-BABE- A comprehensive framework of gene prioritization for flooding tolerance in soybean

Chung Feng Kao

Preview Abstract

Share this article

Soybean [Glycine max (L.) Merr] is rich in protein and oil, which is one of the most important crops around the world. Drastic and extreme changes in global climate has led to decreasing production of crops, deterioration of quality, increasing plant diseases and insect pests, which resulted in economic losses. Facing such a harsh circumstance, a seed which is less susceptible to stresses, both abiotic and biotic, is urgently needed. The present study proposes a comprehensive framework, including phenotype-genotype data mining, integration analysis, gene prioritization and systems biology, to construct prioritized genes of flooding tolerance (FTgenes) in soybean to develop a fast-precision breeding platform for variety selection of important traits in soybean. We applied big data analytic strategies to mine flooding tolerance related data in soybean, both phenomic and genomic, from cloud-based text mining across different data sources in the NCBI. We conducted meta-analysis and gene mapping to integrate huge information collected from multiple dimensional data sources. We developed a prioritization algorithm to precisely prioritize a collection of candidate-genes of flooding tolerance. As a result, 219 FTgenes were selected, based on the optimal cutoff-point of combined score, from 35,970 prioritized genes of soybean. We found the FTgenes were significantly enriched with response to wounding, chitin, water deprivation, abscisic acid, ethylene and jasmonic acid biosynthetic process pathways, which play important role in biosynthesis of plant hormone in soybean. Our results provide valuable information for further studies in breeding commercial varietie.

Value Added Abstracts

Pages: 3 - 3

Investigating the association between the flooding tolerance genes of soybean by pathway analysis and network analysis

Li-Hsin Jhan, Mu-Chien Lai, Chung-Feng Kao

Preview Abstract

Share this article

Under the extreme climate conditions, the events of crop damage are increasing. There is an urgent need to breed stress-tolerant varieties. Flooding stress on different growth stages of soybean can negatively affect seed germination, plant growth, flowering, yield and quality. These impacts are linked with the ability of plant adaptation or tolerance to flooding stress, which involves with complex physiological traits, metabolic pathways, biological processes, molecular components and morphological adaptations. However, investigating mechanisms of flooding stress tolerance is time-consuming. In the present study, we conducted systems biology approaches to identify pathways and network hubs linking flooding stress tolerance. We previously identified 63 prioritied flooding tolerance genes (FTgenes) of soybean from multiple dimensional data sources using large-scale data mining and gene prioritization methods. We conducted competitive (using hypergeometric test) and self-contained (using SUMSTAT) approaches of gene-set enrichment analysis, using gene ontology (GO) database, and found 20 significantly enriched pathways by hypergeometric test and 20 significantly enriched pathways by SUMSTAT. These GO pathways were further compared to seven candidate pathways that identified by gene regulatory pathway databases collected from NCBI PubMed. The FTgenes were found being resist flooding stress in these significantly enriched pathways, which form a module through a closely linked pathway crosstalk network. The module was associated to ethylene biosynthesis, jasmonic acid biosynthesis, abscisic acid biosynthesis, and phosphorylation pathway. The systems biology methods may provide novel insight into the FTgenes and flooding stress tolerance.

Value Added Abstracts

Pages: 4 - 5

Analytical Focus and Contextuality, Exploiting Resolution Scale, Addressing Bias

Fionn Murtagh

Preview Abstract

Share this article

Examples are provided of the following. The Correspondence Analysis, also termed Geometric Data Analysis, platform, exploiting conceptual resolution scale, and having both analytical focus and contextualization,this semantically maps qualitative and quantitative data. Big Data analytics has new challenges and opportunities, and key factors are security through aggregation and ethical accuracy of individual mapping; and process-wise, this is multi-resolution analysis carried out. For the analytical topology of the data, from hierarchical clustering, the following is developed, with properties noted here, and essentially with linear time computational complexity. For text mining, and also for medical and health analytics, the analysis determines a divisive, ternary (i.e. p-adic where p = 3) hierarchical clustering from factor space mapping. Hence the topology (i.e. ultrametric topology, here using a ternary hierarchical clustering), related to the geometry of the data (i.e. the Euclidean metric endowed factor space, semantic mapping, of the data, from Correspondence Analysis). Determined is the differentiation in Data Mining of what is both exceptional and quite unique relative to what is both common and shared, and predominant. A major analytical theme, started now, is for Mental Health, with analytical focus and contextualization, with the objective for interpretation of mental capital. Another analytical theme is to be for developing economies.

Value Added Abstracts

Pages: 7 - 7

Optimization inside the General Efficiency

Vasile Postolic

Preview Abstract

Share this article

This research work is concerned with the study of the General Efficiency and Optimization. We present the Efficiency and Optimization University of BacÄ?u in their most natural context offered by the Infinite Dimensional Ordered Vector Spaces, following our recent results on these subjects. Implications and Applications in Vector Optimization through of the agency of Isac’s Cones and the new link between the General Efficiency and the Strong Optimization by the Full Nuclear Cones are presented. An important extension of our Coincidence Result between the Efficient Points Sets and the Choquet Boundaries is developed. In this way, the Efficiency is connected with Potential Theory by Optimization and conversely. Several pertinent references conclude this investigation.

Value Added Abstracts

Pages: 10 - 11

E-BABE- Data mining: seasonal and temperature fluctuations in thyroid-stimulating hormone

Danchen Wang

Preview Abstract

Share this article

Background: Thyroid-stimulating hormone (TSH) plays a key role in maintaining normal thyroid function. Here, we used “big data” to analyze the effects of seasonality and temperature on TSH concentrations to understand factors affecting the reference interval.

Methods: Information from 339,985 patients at Peking Union Medical College Hospital was collected from September 1st, 2013, to August 31st, 2016, and retrospectively analyzed. A statistical method was used to exclude outliers, with data from 206,486 patients included in the final analysis. The research period was divided into four seasons according to the National Weather Service. Correlations between TSH concentrations and season and temperature were determined.

Results: Median TSH levels during spring, summer, autumn, and winter were 1.88, 1.86, 1.87, and 1.96 ïIU/L, respectively. TSH fluctuation was larger in winter (ï?±0.128) than in summer (ï?±0.125). After normalizing the data from each year to the lowest TSH median value (summer), TSH appeared to peak in winter and trough in summer, showing a negative correlation with temperature. Pearson correlation analysis indicated that the monthly median TSH values were negatively correlated with temperature (r = −0.663, p < 0.001).

Conclusions: This study showed significant seasonal- and temperature-dependent variation in TSH concentrations. Thus, these might be important factors to consider when diagnosing thyroid function disorders.

Value Added Abstracts

Pages: 12 - 12

Using Blockchain for Verifying GDPR Rules in Cloud Ecosystems

Masoud Barati

Preview Abstract

Share this article

Understanding how cloud providers support the European General Data Protection Regulation (GDPR) remains a main challenge for new providers emerging on the market. GDPR inï¬?uences access to, storage, processing and transmission of data, requiring these operations to be exposed to a user to seek explicit consent. A privacy-aware cloud architecture is proposed that improves transparency and enables the audit trail of providers who accessed the user data to be recorded. The architecture not only supports GDPR compliance by imposing several data protection requirements on cloud providers, but also beneï¬ts from a blockchain network that securely stores the providers’ operations on the user data. A blockchain-based tracking approach based on a shared privacy agreement implemented as a smart contract is described – providers who violate GDPR rules are automatically reported

Value Added Abstracts

Pages: 13 - 14

Establishing thresholds and effects of gender, age, and season for thyroglobulin and thyroid peroxidase antibodies by mining real-world big data

Ma Chao

Preview Abstract

Share this article

Background: Thyroglobulin antibody (TG-Ab) and thyroid peroxidase antibody (TPO-Ab) are cornerstone biomarkers for autoimmune thyroid diseases, and establishment of appropriate thresholds is crucial for physicians to appropriately interpret test results. Therefore, we established the thresholds of TG-Ab and TPO-Ab in the Chinese population through analysis of real-world big data, and explored the influence of age, gender, and seasonal factors on their levels.

Methods: The data of 35,869 subjects downloaded from electronic health records were analyzed after filtering based on exclusion criteria and outliers. The influence of each factor on antibody levels was analyzed by stratification. Thresholds of TG-Ab and TPO-Ab were established through Clinical Laboratory Standards Institute document C28-A3 and National Academy of Clinical Biochemistry (NACB) guidelines, respectively.

Results: There were significant differences according to gender after age stratification; the level of TG-Ab gradually increased with age in females. There were significant differences in TG-Ab and TPO-Ab distributions with respect to age after gender stratification. Moreover, differences were observed between seasons for TG-Ab and TPO-Ab. The thresholds of TG-Ab and TPO-Ab were 107 [90% confidence interval (CI):101–115] IU/mL and 29 (90% CI: 28–30) IU/mL, respectively, using C28-A3 guidelines, but were 84 (90% l CI: 50–126) IU/mL and 29 (90% CI: 27–34) IU/mL, respectively, using NACB guidelines.

Conclusion: The levels of TG-Ab and TPO-Ab were significantly affected by gender, age, and season. The thresholds for TG-Ab and TPO-Ab for the Chinese population. were established by big data analysis

Value Added Abstracts

Pages: 15 - 16

Cumbersome task: data science in the old industry

Katharina Glass

Preview Abstract

Share this article

About 3 years ago, my boss decided that it’s time to leverage the superpowers of data. So, I was the first data scientist, a unicorn, amongst 6600 colleges at Aurubis. The primary task was to introduce, to explain, promote and establish data science skillset within the organization. Old industry, like metallurgy and mining, are not the typical examples of successful digital transformation because the related business models are extremely stable, even in the era of hyper-innovation. At least this is what some people believe, and it’s partly true, because for some branches, there is no burning platform for digitization, and hence, the change process is inert. Data science is the fundamental component of digital transformation. Our contribution to the change has a huge impact because we can extract the value from the data and generate the business value, to show people what can be done when the data is there and valid.

I learned that most valuable, essential skills to succeed in our business are not necessarily programming and statistics. We all have training on data science methods at its best. The two must have skills are resilience and communication. Whenever you start something new, you will fail. You must be and stay resilient to rise strongly. Moreover, in the business world is the ability to communicate - tell data-based stories, to visualize and to promote them is crucial. As a data scientist you can only be as good as your communications skills are, since you need to persuade others to make decisions or help to build products based on your analyses. Finally, dare to start simple. When you introduce data science in the industry, you start on the brown field. Simple use cases and projects like metrics, dashboards, reports, historical analysis help you to understand the business model and to assess where is your contribution to success of the company. This is the key to data science success, not only in the multimetal but everywhere else as well

Value Added Abstracts

Pages: 17 - 17

AI-based data analysis for text classification and document summarization

Yuefeng Li

Preview Abstract

Share this article

over the years, businesses have collected very large and complex big data collections, and it has become increasingly difficult to process these big data using the tradition techniques. There is a big challenging issue since the majority of big data is unlabelled in unstructured (information that is not pre-defined) manner. Recently, AI (Artificial Intelligence) based techniques have been used to solve this big issue, e.g., understanding a firm’s reputation using on-line customer reviews, or retrieving of training samples from unlabelled tweets and so on. This talk discusses how AI techniques contribute to text classification and document summarization in the case of only obtaining limited user feedback information for relevance. It firstly discusses the principle of a new classification methodology “a three-way decision based binary classification” to understand the hard issue for dealing with the uncertain boundary between the positive class and negative class. It also extended the application of three-way decisions for text classification to document summarization and sentiment analysis. This talk will presents some new experimental results on several popular data collections, such as RCV1, Reuters-21578, Tweets2011 and Tweets2013, DUC 2006 and 2007, and Amazon review data collections. It also discusses many advanced techniques for obtain more knowledge from big data about the relevance in order to help people to create effective machine learning systems for processing big data, and several open issues regarding to AI-based data analysis for text, Web and media data.

Value Added Abstracts

Pages: 19 - 20

Automated classification of a tropical landscape infested by Parthenium weed (Parthenium hyterophorus)

Zolo Kiala

Preview Abstract

Share this article

The invasive Parthenium weed (Parthenium hyterophorus) adversely affects animal and human health, agricultural productivity, rural livelihoods, local and national economies, and the environment. Its fast spreading capability requires consistent monitoring for adoption of relevant mitigation approaches, potentially through remote sensing. To date, studies that have endeavoured to map the Parthenium weed have commonly used popular classification algorithms that include Support vector machines and Random forest classifiers, which do not capture the complex structural characteristics of the weed. Furthermore, determination of site or data specific algorithms, often achieved through intensive comparison of algorithms, is often laborious and time consuming. Also, selected algorithms may not be optimal on datasets collected in other sites. Hence, this study adopted the Tree-based Pipeline Optimization Tool (TPOT), an automated machine learning approach that can be used to overcome high data variability during the classification process. Using Sentinel-2 and Landsat 8 imagery to map Parthenium weed, wee compared the outcome of the TPOT to the best performing and optimized algorithm selected from sixteen classifiers on different training datasets. Results showed that the TPOT model yielded a higher overall classification accuracy (88.15%) using Sentinel-2 and 74 % using Landsat 8, accuracies that were higher than the commonly used robust classifiers. This study is the first to demonstrate the value of TPOT in mapping Parthenium weed infestations using satellite imagery. Its adoption would therefore be useful in limiting human intervention while optimising classification accuracies for mapping invasive plants. Based on these findings, we propose TPOT as an efficient method for selecting and tuning algorithms for Parthenium discrimination and monitoring, and indeed general vegetation mapping.

Value Added Abstracts

Pages: 21 - 21

Two New Algorithms, Critical Distance Clustering and Gravity Center Clustering

Farag Hamed Kuwil

Preview Abstract

Share this article

We developed a new algorithm based on Euclidean distance among data points and employing some mathematical statistics operations and called it critical distance clustering (CDC) algorithm (Kuwil, Shaar, Ercan Topcu, & Murtagh, Expert Syst. Appl., 129 (2019) 296–310. https://authors.elsevier.com/a/1YwCc3PiGTBULo). CDC works without the need of specifying parameters a priori, handles outliers properly and provides thorough indicators for clustering validation. Improving on CDC, we are on the verge of building second generation algorithms that are able to handle larger size objects and dimensions dataset.

Our new unpublished Gravity Center Clustering (GCC) algorithm falls under partition clustering and is based on gravity center "GC" and it is a point within cluster and verifies both the connectivity and coherence in determining the affiliation of each point in the dataset and therefore, it can deal with any shape of data, lambda is used to determine the threshold and identify the required similarity inside clusters using Euclidean Distance. Moreover, two coefficients lambda and n provide to the observer some flexibility to control over the results dynamically (parameters and coefficients are different, so, in this study, we assume that existing parameters to implement an algorithm as disadvantage or challenge, but existing coefficient to get better results as advantage), where n represents the minimum number of points in each cluster and lambda is utilized to increase or decrease number of clusters. Thus, lambda and n are changed from the default value in case of addressing some challenges such as outliers or overlapping.

Value Added Abstracts

Pages: 6 - 6

Surge-Adjusted Forecasting in Temporal Data Containing Extreme Observations

Smaranya Dey, Subhadip Paul, Uddipto Dutta and Anirban Chatterjee

Share this article

Value Added Abstracts

Pages: 8 - 9

A Big Data Knowledge Computing Platform for Intelligence Studies

Wen Yi

Share this article

Value Added Abstracts

Pages: 18 - 18

Discovering the Dropout Situations Using Statistical and Machine Learning Models

Mahboobeh Zohourian, Marzieh Shekari, Hossein Zamani and Moftakhar Ahmadi

Share this article