++Proposal for Establishing

“PSU Professional Bi-Lingual Data & Text Mining Hub”

 

“Will Provide Guidance and Results for those who are Data-rich, yet Information-poor”                                                                                    

 

 

Preamble

 

The PSU Professional Bi-Lingual Data & Text Mining Hub will serve stakeholders, functional managers and business practitioners in business, industry, government and academia, who have made substantial investments in data collection (English or Arabic), storage, retrieval, visualization and basic analysis but may not have the technical or strategic experience necessary to chart an effective roadmap to uncover the valuable predictive insights hidden within their existing data and/or text.  The Hub will provide:

·         How and where to get started

·         Why failure to implement is so common, and why pitfalls are so avoidable

·         Case studies that reveal the rewards of proper design and implementation

·         Why establishing an internal predictive modeling practice is within one’s reach

·         Tips, tricks and techniques for data preparation and method selection

·         Live participant polls and an interactive guru session with the experts

·         Resources and direction on how to move forward with confidence

·         Applied research, consulting, and solutions

·         Free Webinars, training, and business analytics

·         Meetings, conferences, forums, and data repositories

·         FAQ, potential partners, and student activities

·         Arabic & English Digital Libraries

·          Membership of Social Networks & the Hub

·         Start up projects and links to other universities offering degrees in “Data Mining”

·         And more...

 

Vision

 

 

To establish the first Bi-lingual Mining Hub in the Middle East for both applied structured and unstructured data and text mining, and to advance Arabic Mining Research.

 

 

 Mission

 

  The Hub will provide research environment for top researchers to assess and solve Saudi national bi-lingual data and text mining problems and develop procedures and services with intention to transfer knowledge and expertise (Know-how) by holding public events and demonstrating specialized case studies and by collaborating with other colleges and universities in Saudi Arabia as well as international research groups, universities, and companies.

 

 

 

 

 

Objectives

 

 

How this Hub will be different from others?

 

The Hub will focus on two main areas of mining: Structured Data Mining, and Unstructured Text mining. As for the Structured data mining, the Hub will have two main novel objectives. The first objective is to make the data & text mining process reliable and repeatable by people with little data mining skills by developing a “know how” knowledge based system in this field. A newly proposed vertical stacking of data mining, statistical, and visualation algorithms will provide a uniform framework for user guidance, and experience recording (know how). Stacking will provide a flexible blend of algorithms that will account for different business/agency problems as well as different data types.

 

The second objective is to strive to improve the quantity and quality of Arabic contents in the area of “Data and Text Mining” on the Web. All published material from the Hub’s activities will be translated and reviewed by its author(s) to be available in an Arabic Digital Library. A systematic plan to translate many “data mining” articles and storing them in a searchable Arabic Digital Library will be developed. Text and Multi-media mining tools will be used to explore this Arabic digital library contents and expose related and correlated paragraphs and sections for the purpose of developing new Arabic Text mining algorithms and enhance exiting ones. This brings the other area of focus of the Hub which is the unstructured Text mining.

 

As for the Unstructured Text mining: Parallel to the Arabic digital library there will be also an English Data Mining digital library (having the same contents) that will be developed. Both libraries will have traditional search engine beside more elaborated classification and categorization capabilities. Further to this, Text and Multi-media mining tools will be used to explore the two digital libraries contents and expose related and correlated paragraphs and sections. Text mining is used to find interesting regularities in large textual digital libraries. Where interesting means: non-trivial, hidden, previously unknown and potentially useful. Both Arabic and English Text mining tools handle digital libraries text at the word level, sentence level, document level, document-collection level, linked-document collection level, and at the application level. Most of the text mining methods reply on the fact that there is usually high redundant data in the documents. Most of the tools make use of: document summarization techniques, single document graph visualization algorithms, segmentation algorithms, features selection algorithms, similarity algorithms, clustering, and information extraction techniques.

They also make use of several visualization techniques such as: WebSOM, ThemeScape, Graph-Based visualization techniques, and Tiling-based visualization techniques.

 

Statistical tools for text mining include: Yale/Rapid Miner word vector mining, UIMA by IBM, GATE, Aero Text suite, Attensity, Endeca Technologies, Inxight, and Language Ware.

 

Similar to what we provide for “Data Mining” we also propose the same vertical stacking of text Mining, statistical, and visualization algorithms for performing text mining to both the English and the Arabic data mining digital libraries. This will provide an interesting context for researchers in “Text mining” and “Arabization” fields to investigate how to improve the Arabic text mining algorithms and use a cross reference to the English ones. A very interesting research direction can be developed there. For example, the same mining questions can be posed to both the English and the Arabic digital libraries and the results can be compared. In cases of differences, learning opportunities will be developed and algorithms’ modifications and enhancements are to be investigated. The two libraries will provide several ways and means for verification, validation, and cross checking.

 

Description: Description: http://w3.ibm.com/images/v6/odot.gif 
 Hub Activities  

 

 

It is a known fact that present-day data mining tools are powerful but require significant expertise to implement effectively. An apparent need for building a “Know-How” knowledge base for this field is very much in demand. The distinctive feature of the proposed Hub is the vertical stacking (integration) of traditional data mining methods with both statistical methods and visualision techniques. We believe that the iterative and interactive application of blends of algorithms and techniques from these three areas will provide better insight and capability in analysing available data. Early decisions on models and techniques are not recommended. Exploring and navigating through the available data without such early commitment will lead to better exploration of the solution space. Subsequent convergence and exploitations of particular models and techniques will be justified and more logical. Data Mining algorithms and Statistical analysis are complementing each others since data mining tools are mainly about “hypothesis generation”, where as Statistical Analysis are mainly about “hypothesis testing”. The iterative and interactive application of compatible techniques and methods from both fields will facilitates cross checking, verification, validation and cross validation and thus will be most beneficial for the “Know-how” knowledge base.

 

For example with respect to “Models and Patterns”, we have three known models: Prediction models (e.g. regression), Probability distribution models (e.g. Parameteric, Markov), and Structured data models (e.g. time series, and Transition distribution like Hidden Markov and spatial models). In all these models we have both data mining algorithms as well as statistical analysis algorithms that both generate and test these models.

 

As for “patterns”, we have two types: Global patterns (e.g. clustering), and local patterns (e.g. Outlier detection, Bump hunting, scan statistics, and association rules). Again in these patterns, we have both data mining algorithms as well as statistical analysis algorithms that both generate and test these patterns.

 

The proposed Hub will make extensive use of blends of these vertical stacking (integration) of traditional data mining methods with both statistical methods and visualization techniques in offering the following activities:

 

-Applied Research

-Developing a “Know-How” Knowledge Base

-Free Webinars

-Training & Offering Certificates and later Master Degrees

-Consulting

-Solutions

-Experience

-Business Analytics

-Business Intelligence

-Meetings

-Funding of Research Projects

-Data Repositories

-Forums

-FAQ

-CFP

-International Conference

-Partners & Collaborations

-Students Activities

-Developing English & Arabic Digital Libraries in the areas of Data and Text Mining

-Membership of Social Networks

-Membership of the Hub

-Links to other Universities offering degrees in “Data Mining”

-Hub News

-Start Up Projects

 

Results of the above activities will provide the contexts within which the “Know-how” knowledge will be generated and added to the proposed “Know-how” knowledge based system.

 

Also the results of the above activities will be translated into Arabic and will get stored in both the Arabic and English data mining Digital Libraries with both a traditional search engines and text and multi-media mining tools for deeper examinations of the contents. This will be the first bi-lingual Arabic/English resource in this vital specialization area. Arabic content is currently way behind in the Web world. We would like to focus our efforts on this specialized area and work hard to develop a decent data Mining Arabic/English Digital Libraries that can gain international recognition.

 

 

Description: Description: http://w3.ibm.com/images/v6/odot.gif
    Resources of the HUB

 

 

 

     The PSU Data Mining Hub will have several high-end servers, a host of licenses and domain web server; Classification and regression server, Clustering and segmentation server, Deviation, fraud detection and recommendation server, Statistical and association server, Databases and Digital Library server, Visulization and simulation tools as well as adequate software development and analysis tools for carrying out any data mining project. The Hub caters to research in the areas of Data Mining, On-Line Statistical Analysis, Knowled-Based Systems, Business Intelligence, Decision Support, Data Warehousing, Knowledge Discovery, Predictive Analytics, Text Mining, Arabization and Data Modelling with various case studies and proof-of-concept projects.

 

All vendors pretty much provide the same set of tools. There are also many embedded products (e.g. in fraud detection, health care, customer relationship management, etc.). Some of the suites and tools in the servers may be redundant, but we list them here for completeness. These resources can be built incrementally starting with an essential core and add to it.

 

Hardware and Software Resources


High-End Server

Licenses, Domain Home Server  
(3.0/2x Xeon/2GB RAM/60GB)

·            Licenses, Domain Home Server

High-End Server-Classification & Regression Server
(Linux/2xXeon/2 GB RAM/60GB)

·            Decision Tree Tools
 

·            Rule-Based tools

·            Neural Network Tools

·            Bayesian Tools

·            Support Vector Machines (SVM) Tools

·            Genetic Algorithms Tools

·            Nearest neighbour Tools

·            Time Series Analysis Tools

·            Multiple Approaches Tools

·             Visualization Tools

·            AdvancedMiner Suite

·            Angoss Studio Suite

·            BayesiaLab Suite

·            Clementine Suite

·            Data Applied Suite
 

·            IBM Data Explorer a visualization tool

·            Case Based Reasoning Tools 

·            GNU C compiler and other GNU tools

·            LabView Tools

·            Matlab Simulink & Toolboxes 
 

High-End Server-Clustering & Segmentation Server
(Linux /2x Xeon,/4 GB RAM/150GB)

·            Clustering Tools

·            Bayesian Networks Tools

·            Neural Network Clustering Tools

·            Summarization Tools

·            DBMiner Enterprise Suite

·            EWA Systems Suite

·            -Exeura Riato Suite

·            Fair, Isaac Business Science Suite

·            GhostMiner Suite

·            IBM Intelligent Miner Suite

·            Insightful Miner Suite

·            KnowledgeMiner Suite

·            KXEN Suite

·            MCubiX Diagnos Suite

·            MERKUR Miner Plus Suite

High-End Server –Deviation, Fraud Detection & Recommendation Server
(Linux-GNU/2x P-IV/4 GB RAM/20GB)

·            OLAP Tools

·            Anomalies Identification Tools

·            Deviation Detection Tools

·            -Case Based Reasoning Tools

·            Open Graph Tools

·            Recommendations Tools

·            -Content Recommendations Tools

·            SAAS Tools

·            -Real-Time Analytics Tools

·            Oracle Data Mining Suite

·            Polyanalyst Suite

·            Predictive Data Mining Suite

·            RapAnalyst Suite

·            Salford Systems Data Mining Suite

·            GNU Software Development Tools
 

High-End Server-Statistical & Associations Server
(Windows XP/2x P-IV-HT/ 4 GB RAM/80GB)

·            Regression Tools

·            SAS Tools

·            Decision Trees Tools

·            MatLab

·            R Tools

·            SPSS Tools

·            Statistics Tools

·            Market basket Analysis Tools

·            Association Discovery Tools

·            Genetic Algorithms Tools

·            SAS Enterprise Miner Suite

·            SPAD Suite

·            SPSS Suite

·            Statistica Data Miner Suite

·            Synapse Suite

·            TIBCO Spotfire Miner Suite

·            Viscovery Data Mining SuiteWitness Miner Suite

·            Xpertrule Miner Suite

High-End Server- Databases and Digital Library Server

(Windows XP/2x P-IV-HT/ 4 GB RAM/80GB)

·            Digital Library Tools

·            Text mining Tools

·            Multi-media mining Tools

·            Oracle DBMS

·            DB2 DBMS

·            SQL DBMS

·            MysQL DBMS

·            Sensor Databases

·            P2P Databases

·            Embedded Systems Databases

·            OO DBMS

·            Data Warehousing Tools

Clients (10+ in nos.)
(Win XP/P-IV/512 MB RAM/40GB)

·            Digital Library Tools

·            Text Mining Tools

·            Multi-media mining Tools

·            Data Preparation Tools
 

·            Data Cleaning Tools

·            Visulalization Tools

·            Summarization Tools

·            Data Transformation Tools

·            Reporting Tools

·            Application Development Tools

 

Open Source  Tools

·            Weka Suite

·            Rattle Suite

·            Rapid Miner Suite

·            Orange Suite

·            Mining Mart Suite

·            IBM Intelligent Miner Suite

·            ADaM Development Suite

·            Alpha Miner Suite

Other Software Tools

·            Visulalization Tools

·            Sequential Analysis Tools

·            Collaborative Filtering Tools


 

 

 

 

 

 

 

Description: \\10.0.8.99\asameh\cs-500\text-mining-course-web_files\image003.gif 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


In the above box put a diagram that shows the servers and the clients structure in the Hub.

Description: Description: http://w3.ibm.com/images/v6/odot.gif
 Target Audience 

 

Personnel with any of the following: domain , data base, and statistical expertise such as:

IT and IS EXECUTIVES AND MANAGERS: CIOs, CKOs, CTOs, Stakeholders, Functional Officers, Technical Directors and Project Managers

LINE-OF-BUSINESS EXECUTIVES AND FUNCTIONAL MANAGERS: Risk Managers, Customer Relationship Managers, Business Forecasters, Inventory Flow Analysts, Financial Forecasters, Direct Marketing Analysts, Medical Diagnostic Analysts, eCommerce Company Executives

TECHNOLOGY PLANNERS: Who survey emerging technologies in order to prioritize corporate investment

CONSULTANTS: Whose competitive environment is intensifying and whose success requires competency with data mining and related emerging information technologies

RESEARCHERS: Who are interested in both structured bi-lingual data mining algorithms and their applications, as well as unstructured bi-lingual text mining algorithms.

 

 

The “Know-How” Knowledge Based System

All the Hub’s activities will provide learning opportunities for developing and growing (acquiring) a “Know-how” knowledge base. The suggested system can be built using any of the following techniques: rule-based techniques, inductive techniques, symbol manipulation techniques, case-based techniques, and/or qualitative techniques (such as model-based, temporal, reasoning, and neural networks). It will have an inference engine that will reason and search (compose) solutions. It will also have a learning module (knowledge acquisition) that will allow the system to learn and improve its performance through exposure to learning opportunities (contexts and decisions). It will also have an explanation module that would allow answers for reasoning questions such as: Why, What if, What is, How, and Why not.

The “Know-how” knowledge base is a sort of a decision support system for those who may not have the technical or strategic experience necessary to chart an effective roadmap to uncover the valuable predictive insights hidden within their existing data.  The “know-how” knowledge base will provide:

·            How and where to get started with a specific data set

·            Causes  of failure to straight forward application of a specific tool, and how pitfalls can be avoided

·            Relevant Case studies that reveal the rewards of proper algorithm selection, proper design and careful implementation when dealing with a specific data set

·            Why establishing an internal predictive modeling practice is within one’s reach – will also establish a roadmap for a specific data set

·            Tips, tricks and techniques for a specific data set preparation, method selection, validation methods, and gluing of appropriate data mining, statistical, and visualization methods (making use of the stacking approach)

·            Interactive guru session with explanations

·            Resources (meta-knowledge) and direction on how to move forward with confidence

 

 

 

Description: Description: http://w3.ibm.com/images/v6/odot.gif
 Certificate 

 

The PSU Hub will offer a data mining certificate similar to the following international certificates:

-University of California San Diego- Data mining certificate

-University of Connecticut- Data Mining online Master Degree and Graduate Certificate

-Stanford University- Data Mining and Applications graduate certificate

-SAS and Oklahoma State University Data Mining certificate (This is a jopint Certificate with SAS)

-University of Louisville graduate certificate in data Mining

-New jersey’s Science & Technology University certificate in data Mining.

 

PSU is to seek offering certification in Data Mining with the collaboration of either SAS Saudi Arabia or IBM Saudi Arabia. SAS Saudi Arabia is currently offering certificates in Statistics with both King Fahd University for Petroleum and Mineral and King Saud University. PSU can start the first “Data Mining” certificate with SAS. IBM Saudi currently provides two Data Mining solutions: “Cognos” and “DB2”. PSU can offer joint Data Mining certificate with IBM utilizing these two packages.

The requirements for any of these certifications can be:

§   12 credit hours

§   1 year to complete (on average)

§   Graduate Certificate may qualify for Financial Assistance

What are the required courses?

·            Data Management System Design

·            Data Mining and Management

o              Select two(2) from:

·            JAVA Programming

·            Advanced Database Systems

·            Information Retrieval

·            Knowledge Based Systems

Courses can be offered both in class and on-line. Also see “Learning Outcomes” of the Certificate in the appendix.

Description: Description: http://w3.ibm.com/images/v6/odot.gif
 Masters Graduate Degree

 

Completing the Certificate is equivalent to getting a PSU Graduate Diploma. The Diploma qualifies students to enter into PSU Master program in Information Systems or Computer Science. The PSU Master degree can be built incrementally starting with both the “Netversity” Certificate Diploma holders and the SAS-PSU or IBM-PSU Data Mining Diploma holders.

 

PSU can also establish a link with similar programs such as the Oklahoma State University – SAS program and arrange for PSU graduates to continue their post graduate degrees in OSU (courses at PSU and research theses at OSU)

 

Applied Research

 

 

 

 

The following are some Application areas that the Hub will focus upon (Suitable for Saudi Arabia):

 

-Finance: Credit Card Analysis

-Insurance: Claims, Fraud Analysis

-Telecommunication: Call Record Analysis

-Transport: Logistics management

-Consumer Goods: Promotion Analysis

-Data Service Providers: Value Added Data

-Utilities: Power Usage Analysis

-Medicine: Effectiveness of treatments & relationship between diseases

 

The following diagram shows the percentages of “data Mining” sectors in each area worldwide.

 

 

Description: \\10.0.8.99\asameh\cs-500\text-mining-course-web_files\image004.jpg

Here are some examples of “Data Mining” Applications (questions) in the above sectors:

 

z            Performing basket analysis

y                                    Which items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions.

z            Sales forecasting

y                                    Examining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item?

z            Database marketing

y                                    Retailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. This information can be used to focus cost–effective promotions. Also can recongise sales patterns amamong outlets, so identify trends and shifts in customers taste

z            Merchandise planning and allocation

y                                    When retailers add new stores, they can improve merchandise planning and allocation by examining patterns in stores with similar demographic characteristics. Retailers can also use data mining to determine the ideal layout for a specific store.

z            Card marketing

y                                    By identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs, targeted product development, and customized pricing.

z            Cardholder pricing and profitability

y                                    Card issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers. Includes risk-based pricing.

z            Fraud detection

y                                    Fraud is enormously costly. By analyzing past transactions that were later determined to be fraudulent, banks can identify patterns. They can isolate the factors that lead to fraud, waste and abuse. They can also target auditing and investigative efforts more effectively. By modeling each credit card holder’s requested transactions against the customer’s past spending history, fraud transactions can be identified. Also can idefify health insurance fraud  by mining insurance claims including laboratory tests and identify un-needed expensive tests. Also can check information on new card applications against data from Credit Bureaus. To stop new accounts. Also from an online stream of events identify fraudulent events.

y                                     

z             Predictive life-cycle management

y                                    DM helps banks predict each customer’s lifetime value and to service each segment appropriately (for example, offering special deals and discounts).

z            Call detail record analysis

y                                    Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions.

z            Customer loyalty

y                                    Some customers repeatedly switch providers, or “churn”, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.

z            Customer segmentation

y                                    All industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis. This can be used for recruiting and attracting customers; identify profitable customers.; and build profiles of customers likely to use which services.

z            Manufacturing

y                                    Through choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand. Quality Control:  - building predictive models for the effects of production parameters on produced items’ performance.

z            Warranties

y                                    Manufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims. Also Parts failure prediction.

z            Frequent flier incentives

y                                    Airlines can identify groups of customers that can be given incentives to fly more.

z            Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

z            Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?

z            If I raise the price of my product by Rs. 2, what is the effect on my ROI?

z            If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?

z            If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?

z            Which of my customers are likely to be the most loyal?

z            `Forecasting what may happen in the future

z            Classifying people or things into groups by recognizing patterns

z            Clustering people or things into groups based on their attributes

z            Associating what events are likely to occur together

Sequencing what events are likely to lead to later events. E.g. Credit/Risk Scoring and Intrusion Detection.

1.          Health Care: What percentage of people in the test group have high blood pressure with these characteristics: 66-year-old male regular smoker that has low to moderate salt consumption?

2.          Do the risk levels change for a male with the same characteristics who quit smoking? What are the percentages?

3.          If you are a 2% milk drinker, how many factors are still interesting?

4.          Knowing that salt consumption and smoking habits are interesting factors, which one has a stronger correlation to blood pressure levels?

Grow an automatic tree. Look to see if gender is an interesting factor for 55-year-old regular smoker who does not each cheese?

 

 

 

 

 

 

 

 

Funding of Projects

 

PSU Hub will sponsor research funds and grants for outstanding research projects, which can create value to the Kingdom of Saudi Arabia and society. These research grants will be awarded for high-level and promising Data Mining research projects by individuals or groups from academia and/or industry actively involved in the development and research. These projects should be based on either a universally known technology or a new technology developed by the applicant and should be aimed at achieving viable systems, algorithms, or processes beneficial to the nation. The grant will be provided for a period of 3 or 6 Months depending on the scope of the project.

Faculty members and Research staff of all faculties of PSU as well as all universities of the Kingdom of Saudi Arabia can apply for this research grant, however the scope of their proposed project should be in any aspect of Data Mining. The principal investigator should hold doctorate degree and he/she should have a solid background in the proposed research with well-reputed international and national journal and conference publications. The projects which will be jointly pursued with industrial government, and commercial organizations will be given higher priority. All submitted proposals will be peer-reviewed/evaluated by a panel of experts and will be approved on their originality, novelty, relevance, significance, and quality, etc.   

•    Solving a national problem
•    Developing a specific algorithm, process or patent
•    Investigating data mining in the scope of the center’s research areas (provided in the application form)
•    Short-period time frame

Research Type

Total

1st Deposit

2nd Deposit

At Publication

No. of Papers

Small (3 Months)

30000

10000

10000

10000

1

Medium (6 Months)

60000

20000

20000

20000

2

 

 

 

 

English & Arabic Data Mining Digital Libraries 

 

Description: Description: Library banner

 

One of the main objectives of the Hub is to strive to increase the quantity and quality of Arabic articles in the area of “Data Mining” on the Web. All published activities from the Hub will be translated and reviewed by its author(s) to be available also in an Arabic Digital Library. A systematic plan to translate many “data mining” articles and storing them in a searchable Arabic Digital Library will be developed. Parallel to this an English Data Mining digital library will be developed. Both libraries will have traditional search engine beside more elaborated text mining capabilities.

 

Text and Multi-media mining tools will be used to explore the digital libraries contents and expose related and correlated paragraphs and sections. Text mining is used to find interesting regularities in large textual digital libraries. Where interesting means: non-trivial, hidden, previously unknown and potentially useful. Text mining tools handle digital libraries text at the word level, sentence level, document level, document-collection level, linked-document collection level, and at the application level. Most of the text mining methods reply on the fact that there is usually high redundant data in the documents. Most of the tools make use of: document summarization techniques, single document graph visualization algorithms, segmentation algorithms, features selection algorithms, similarity algorithms, clustering, and information extraction techniques.  They also make use of several visualization techniques such as: WebSOM, ThemeScape, Graph-Based visualization techniques, and Tiling-based visualization techniques.

 

Statistical tools for text mining include: Yale/Rapid Miner word vector mining, UIMA by IBM, GATE, Aero Text suite, Attensity, Endeca Technologies, Inxight, and Language Ware.

 

Similar to what we provide for “Data Mining” we also provide the same vertical stacking of text Mining, statistical, and visualization algorithms since performing text mining to both the English and the Arabic digital libraries will provide an interesting context for researchers in “Text mining” and “Arabization” to investigate how to improve the Arabic text mining algorithms and use a cross reference to the English one. A very interesting research direction can be developed there. For example, the same mining questions can be posed to both the English and the Arabic digital libraries and the results can be compared. In cases of differences, learning opportunities will be developed and algorithms’ modifications and enhancements are to be investigated. The two libraries will provide several ways and means for verification, validation, cross checking. This will be sort of an experimental testbed or sandbox for testing and experimentation.

 

 

 

Free Webinars

 

 

The PSU Hub will schedule Web broadcasts of its meetings and conferences. Other Web webcasts will be announced and followed such as:

·            SAS Webcasts on Analytics, Predictive Modeling and Data Mining.

·            Predictive Analytics Applied - Immediate access, on-demand
These are self-paced online courses that cover predictive analytics applications, core technology, and management. Detailed case studies and software demos are included.

 

Training

 

In addition to the Cerificate and Diploma in Data Mining, custom-made courses could be arranged by the Hub to be delivered to specific targeted audience. It is important to know that Data Mining is not:

z            Brute-force crunching of bulk data

z            “Blind” application of algorithms

z            Going to find relationships where none exist

z            Presenting data in different ways

z            A database intensive task

z            A difficult to understand technology requiring an advanced degree in computer science

 

Experience gained from Training will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

 

Consulting

 

 

The Hub will provide several consulting services such as  Developing  strategic plans for data mining, reviewing and evaluating the feasibility of applying data mining techniques to an enterprise, providing trained human resources, developing enterprise procedures and systems for data mining and knowledge discovery.

 

Experience gained from consulting will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

 

Solutions

 

The Hub can offer special presentations of its current “Know-how” knowledge base. The benefit of these presentation is that they will provide alternative available solutions and when and how they will be applied; How and where to get started; Why failure to implement is so common, and why pitfalls are so avoidable; Case studies that reveal the rewards of proper design and implementation; Why establishing an internal predictive modeling practice is within one’s reach; Tips, tricks and techniques for data preparation and method selection; Live participant polls and an interactive guru session with the experts; Resources and direction on how to move forward with confidence

 

Experience gained from developed solutions will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

Experience

 

 

Accumulated experience of the Hub personal will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

 

Business Analytics

 

 

 

Experience gained from business cases will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

Business Intelligence

 

 

 

 

Experience gained from consulting and running business intelligence projects will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

Meetings

 

 

 

 

Experience gained from mettings will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

Conferences

 

 

 

 

 

Data Repositories

 

Examples of existing Data repositories:

·            KDD Cup center, with all data, tasks, and results.

·            UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.

·            UCI Machine Learning Repository.

·            AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.

·            Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.

·            Data.gov.uk, publicly available data from UK.

·            DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Goverment datasets.

·            Delve, Data for Evaluating Learning in Valid Experiments

·            Enron Email Dataset, data from about 150 users, mostly senior management of Enron.

·            FEDSTATS, a comprehensive source of US statistics and more

·            FIMI repository for frequent itemset mining, implementations and datasets.

·            Financial Data Finder at OSU, a large catalog of financial data sets

·            GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.

·            Grain Market Research, financial data including stocks, futures, etc.

·            ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.

·            Infobiotics PSP (protein structure prediction) datasets, adjustable real-world family of benchmarks for testing the scalability of classification/regression methods.

·            Investor Links, includes financial data

·            Microsoft's TerraServer, aerial photographs and satellite images you can view and purchase.

·            MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.

·            NASDAQ Data Store, provides access to market data.

·            National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.

·            National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.

·            PubGene(TM) Gene Database and Tools, genomic-related publications database

·            SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.

·            SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site.

·            STATOO Datasets part 1 and STATOO Datasets part 2

·            UCR Time Series Classification/Clustering page, offering datasets, papers, links, and code.

·            United States Census Bureau.

 

 

 

 

Forums

 

 

Experience gained from forums’ discussions will be documented in “Know-How” knowledge base, and will be added experience to the target knowledge base.

 

FAQs

 

 

 

 

 

 

CFP

 

Links to most updated “Call for Papers” in the field.

 

 

 

 

International Conference

 

 

 

 

 

Partners & Collaborations  

 

Current possibilities:

 

OSU-SAS

SAS

IBM

 

 

Students Activities

 

Students can work as:

-Research assistances

-help in organizing Free Webinars

-help in Training

-help in Business Analytics

-help in Business Intelligence

-help in organizing Meetings

-help in organizing Conferences

-help in maintaining Data Repositories

-help in moderating Forums

-help in seeking Partners & Collaborations

-help in maintaining the Arabic Digital Library

-help in moderating Membership of Social Networks

-help in maintaining the Membership of the Hub

-help in establishing Links to other Universities offering degrees in “Data Mining”

-help in providing and broadcasting Hub News

 

 

Member of Social Networks

 

In order to have more exposure, the Hub will become a member of several current social networks such as:

FDescription: Description: handdrawnsocial ollow Us on...

Description: Description: تابعونا في تويترDescription: Description: تابعونا في فيسبوكDescription: Description: تابعونا في يوتيوب

Main Menu

 

Become a Member of the Hub

 

Membership Goals:

·            Building database of professionals in “Data Mining”

·            Encouraging information exchange among members

·            Encouraging researchers to publish in “Data Mining” field.

·            Supporting Arabic content in Data Mining.

·            Providing knowledge in Data Mining.

·            Establishing a relationship between professionals and stakeholders.

Membership Benefits:

·            Having the latest news and events supported by the Hub

·            The possibility of participating in research and consultations projects.

·            The possibility of getting special rate in training which carried out by the Hub

·            The possibility of funding member’s research.

·            The possibility of funding member’s  publications.

·            Receiving news and awareness information.

·            Receiving update information on Data Mining conferences.

·            The possibility of accessing to the Hub’s Arabic Digital Library.

·            The possibility of having discounts by Hub sponsors

·            Receiving messages related to events.

Membership Conditions:

·            Providing true and accurate information.

·            Participating in publishing articles and research on the Hub’s website.

·            Supporting other Hub members.

·            Present the Hub works to professionals; and working hard to attract research opportunities to the Hub.

 

Links to Data Mining Research Groups

 

 

 

 

Links to Other Universities Offering “Data Mining” Degrees

 

 

Hub News

 

FeedBurner  makes it easy to receive content updates in My Yahoo!, Newsgator, Bloglines, and other news readers. It works with web-based news readers such as:

Description: Description: Subscribe with My Yahoo!Description: Description: Subscribe with NewsGatorDescription: Description: Subscribe with NetvibesDescription: Description: Subscribe with GoogleDescription: Description: Subscribe with PlusmoDescription: Description: Subscribe with The Free DictionaryDescription: Description: Subscribe with Bitty BrowserDescription: Description: Subscribe with NewsAlloyDescription: Description: Subscribe with Live.comDescription: Description: Subscribe with Excite MIXDescription: Description: Subscribe with Attensa for OutlookDescription: Description: Subscribe with WebwagDescription: Description: Subscribe with Podcast ReadyDescription: Description: Subscribe with FlurryDescription: Description: Subscribe with WikioDescription: Description: Subscribe with Daily Rotation

 

 

Start Up Projects at PSU Data Mining Hub

 

--You can download a PPT that describes the projects below

 

 

Attached are two Start up projects for the PSU Data mining Hub. They are:

 

 1-Medical Data mining & Knowledge Discovery- “Constructing Rule-Based Knowledge Bases Using Ant Colony Optimization”:

Medical service in the Kingdom has gone through several leaps of improvement in the recent few years. Currently medium to large hospitals have established several kinds of modern information systems to keep track of their operations. Computerized information systems are in place to keep records of patients, operations, materials, tests, procedures, medications, facilities, etc. With the heavy traffic that goes through such hospitals, they are currently overwhelmed with such amounts of data. They certainly need guidance and results to make effective use of such data-rich records, yet information-poor. There is so much one can reveal from the amount of accumulated data that hospitals acquire over the years. In this project we propose a Self-organizing Ant Colony Optimization (ACO) technique that is inspired by the behavior of the ants as social insect that work together to accomplish a common goal using wisdom of the crowd. ACO is one of the algorithms that put swarm intelligence into action. Swarm intelligence, which is based on the idea of collective behavior, has occupied ACO in various fields and problem solving domains. Data mining is one of the domains where ACO has been applied successfully and provided scalable solutions. In this project, we describe a knowledge discovery classification technique based on ACO. AntMiner,  is a rule induction algorithm that occupies collective intelligence to construct classification rules. Experimental results are shown as the AntMiner+ is implemented with different variations inspired from discrete optimization, fuzzy rule induction, self-organizing map (SOM), dimensionality reduction, parallel simultaneous rule learning and tested on different datasets. Moreover, further combinations of these variations that produced enhancement are also proposed and tested.  

 

Current proposal draft

 

     --- Ahmed-Kamal file (This file I will keep updating each week ISA)

 

 

2-Satellite Data Mining:

This is an Image Mining research proposal.

This project proposes a novel approach for monitoring “hot points” on Saudi Arabia ground. Hot points, are the locations on ground such as boards with neighbours, places which have the most frequent visits from spy orbiting satellites, oil fields and religious places. These points should be tightly secured for Saudi national security. This work will provide results of data mining satellite images of these points over any specified period of time to be used by decision makers in defence and oil exploration.

Currently there are three Saudi’s satellites (Saudisat `A, 1B, and 1C) also known by Saudi Osxar 41, 42, and 50. They are owned by King Abdel Aziz for Science and Technology. We need to see how to use them wisely in such project. We first need to know what they are currently used for and the kind of research done by them. There are many web info about them. We need to establish some sort of linkage with researchers at this institute.

 

---Faisal-Ali file (This file I will keep updating each week ISA)

 

 

Current proposal draft

 

3- Development of Arabic Text Mining Software Tools:

 

---Omar-file (This file I will keep updating each week ISA)

 

Software tools for English Text Mining:

http://www.kdnuggets.com/software/text.html

 

 

Objective:

Is to adapt and modify selected English Text Mining tools (from the above web site) in order to produce their equivalent Arabic versions. The cross validation method requires very accurate English/Arabic translator that will provide input data to the Algorithm/program conversion.

 

 

English/Arabic Translators:

They vary in their accuracy. Some sites:

http://translation.babylon.com/english/to-arabic/

 

Arabic Natural Language Processing:

 

Michigan Tutorial

Tutorial on Text Mining

 Arabic Understanding

Arabic Text Mining paper

Arabic Text Mining open source 

 

 

Methodology:

The second objective is to strive to improve the quantity and quality of Arabic contents in the area of “Data and Text Mining” on the Web. All published material from the Hub’s activities will be translated and reviewed by its author(s) to be available in an Arabic Digital Library. A systematic plan to translate many “data mining” articles and storing them in a searchable Arabic Digital Library will be developed. Text and Multi-media mining tools will be used to explore this Arabic digital library contents and expose related and correlated paragraphs and sections for the purpose of developing new Arabic Text mining algorithms and enhance exiting ones. This brings the other area of focus of the Hub which is the unstructured Text mining.

 

As for the Unstructured Text mining: Parallel to the Arabic digital library there will be also an English Data Mining digital library (having the same contents) that will be developed. Both libraries will have traditional search engine beside more elaborated classification and categorization capabilities. Further to this, Text and Multi-media mining tools will be used to explore the two digital libraries contents and expose related and correlated paragraphs and sections. Text mining is used to find interesting regularities in large textual digital libraries. Where interesting means: non-trivial, hidden, previously unknown and potentially useful. Both Arabic and English Text mining tools handle digital libraries text at the word level, sentence level, document level, document-collection level, linked-document collection level, and at the application level. Most of the text mining methods reply on the fact that there is usually high redundant data in the documents. Most of the tools make use of: document summarization techniques, single document graph visualization algorithms, segmentation algorithms, features selection algorithms, similarity algorithms, clustering, and information extraction techniques.

They also make use of several visualization techniques such as: WebSOM, ThemeScape, Graph-Based visualization techniques, and Tiling-based visualization techniques.

 

Statistical tools for text mining include: Yale/Rapid Miner word vector mining, UIMA by IBM, GATE, Aero Text suite, Attensity, Endeca Technologies, Inxight, and Language Ware.

 

Similar to what we provide for “Data Mining” we also propose the same vertical stacking of text Mining, statistical, and visualization algorithms for performing text mining to both the English and the Arabic data mining digital libraries. This will provide an interesting context for researchers in “Text mining” and “Arabization” fields to investigate how to improve the Arabic text mining algorithms and use a cross reference to the English ones. A very interesting research direction can be developed there. For example, the same mining questions can be posed to both the English and the Arabic digital libraries and the results can be compared. In cases of differences, learning opportunities will be developed and algorithms’ modifications and enhancements are to be investigated. The two libraries will provide several ways and means for verification, validation, and cross checking

 

Arabicl-Template-PSU Research Proposal.docx

Current Proposal Draft

 

 

4- Know-how Knowledge-based System:

 

Current Proposal Draft

 

 

 

 


 

Extra Appendix:

 

 

Learning Outcomes: Certificate or Diploma in Data Mining

Outcome

Courses

Be able to approach data mining as a process, by demonstrating competency in the use of CRISP-DM, the Cross-Industry Standard Process f or Data Mining, including the business understanding phase, the data understanding phase, the expl or at or y data analysis phase, the modeling phase, the evaluation phase, and the deployment phase.

Be proficient with leading data mining software, including WEKA, Clementine by SPSS, and the R language.

Understand and apply a wide range of clustering, estimation, prediction, and classification algorithms, including k-means clustering, BIRCH clustering, Kohonen clustering, classification and regression trees, the C4.5 algorithm, logistic Regression, k-nearest neighbor, multiple regression, and neural networks.

Understand and apply the most current data mining techniques and applications, such as text mining, mining genomics data, and other current issues.

Understand the mathematical statistics foundations of the algorithms outlined above.

 

Course Notes:

 

-             Course PDF slides

-             Course ppt slides

-             Main Resources Web Site 

 

below

Assignments, Mid-term Quiz, and Final Exam

Datasets

Assign-1-solution

Assign-2-solution

Assign-3-solution

Assign-4-solution

Assign-5-solution

 

Grade Distribution:

Assignments      15%

Term Project        30%

Midterm         15%

Final    40%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.