Tuğba Canan OĞUZ

Portfolio

About

Having 3+ years of IT experience and 1+ year of data science experience. Worked on multi-class classification algorithms (SVM, Decision Tree, Random Forest, and XGBoost) for handwritten digit recognition, raisin class prediction with KNN and Logistic Regression algorithms, and concrete strength prediction with Linear Regression. Extensive experience in web scraping and web automation. A motivated and detail-oriented team player with a strong work ethic and a passion for innovation. Looking for opportunities to leverage my skills and experience to drive impactful projects and achieve business objectives.

Go to the GitHub page...

Data

The concrete slump test measures the consistency of fresh concrete before it sets. It is performed to check the workability of freshly made concrete, and therefore the ease with which concrete flows. It can also be used as an indicator of an improperly mixed batch.

Data set consists of various cement properties and the resulting slump test metrics in cm. Later on the set concrete is tested for its compressive strength 28 days later.

Input variables:

Cement
Slag
Fly ash
Water
SP
Coarse Aggr.
Fine Aggr.
SLUMP (cm)
FLOW (cm)

Target variable:

28-day Compressive Strength (Mpa)

Comparing Results

Ridge model seems better according to the R-square and mean absolute error (MAE) scores.

Feature Importance

According to the feature importance coefficients; Cement, Fly ash, Coarse aggr., and Water are the most important and representative features for the model.

Go to the GitHub page of Logistic Regression...

Go to the GitHub page of KNN...

Data

Images of Kecimen and Besni raisin varieties grown in Turkey were obtained with CVS. A total of 900 raisin grains were used, including 450 pieces from both varieties. These images were subjected to various stages of pre-processing and 7 morphological features were extracted. These features have been classified using three different artificial intelligence techniques.

Attribute Information:

Area: Gives the number of pixels within the boundaries of the raisin.
Perimeter: It measures the environment by calculating the distance between the boundaries of the raisin and the pixels around it.
MajorAxisLength: Gives the length of the main axis, which is the longest line that can be drawn on the raisin.
MinorAxisLength: Gives the length of the small axis, which is the shortest line that can be drawn on the raisin.
Eccentricity: It gives a measure of the eccentricity of the ellipse, which has the same moments as raisins.
ConvexArea: Gives the number of pixels of the smallest convex shell of the region formed by the raisin.
Extent: Gives the ratio of the region formed by the raisin to the total pixels in the bounding box.es
Class: Kecimen and Besni raisin.

https://archive.ics.uci.edu/ml/datasets/Raisin+Dataset

Boxplot

Area, MajorAxisLength, ConvexArea, and Perimeter features will be more useful for distinguishing the two classes.

Logistic Regression

Scaling

Standard and Robust Scaler are applied and compared. Results were slightly better for the Robust Scaler.

Comparing Results for Logistic Regression

Since data have marginal values, two scaling methods are used for comparison. No significant difference was found between them, yet Robust Scaler results are slightly better on the confusion matrix since Robost Scaler is more resistant to outliers.

* Standard Scaled Model ---> ROC-AUC = 0.92 ----> 23 false predictions

* Robust Scaled Model ---> ROC-AUC = 0.92 ----> 22 false predictions

K-Nearest Neighbors (KNN)

Elbow Method for Optimum "k"

Red dots are representing test data, while blue ones are representing train data. The smallest gap between train & test data for the Error Rate corresponds to k=13. However, smaller "k" values could also be a good option if there is a cost concern. So smaller "k" values are also applied to see and compare the results.

Comparing Results for KNN

* KNN_19 (grid) Model ---> accuracy: 0.83 ---> roc_auc: 0.92 ---> 30 false predictions

* KNN_13 (elbow) Model ---> accuracy: 0.87 ---> roc_auc: 0.91 ---> 29 false predictions

* KNN_5 (elbow) Model ---> accuracy: 0.86 ---> roc_auc: 0.89 ---> 26 false predictions

* Two of the scores for KNN_5 model are better than grid model.

* KNN_5 would be more cost efficient option to choose. However, if cost is not an issue, KNN_13 is also a good option.

Overall Assesment for Final Model Selection

Logistic Regression scores were better than the KNN scores.

* KNN_13 (elbow) Model ---> accuracy: 0.87 ---> roc_auc: 0.91 ---> 28 false predictions

* Robust Scaled Model ---> ROC-AUC = 0.92 ----> 22 false predictions

Go to the GitHub page of SVM...

Go to the GitHub page of Decision Tree...

Go to the GitHub page of Random Forest...

Go to the GitHub page of XGBoost...

Data

This digit database was created by collecting 250 samples from 44 writers. WACOM PL-100V pressure sensitive tablet used with an integrated LCD display and a cordless stylus. The input and display areas are located in the same place. Attached to the serial port of an Intel 486 based PC, it allows us to collect handwriting samples. The tablet sends $x$ and $y$ tablet coordinates and pressure level values of the pen at fixed time intervals (sampling rate) of 100 miliseconds.

These writers are asked to write 250 digits in random order inside boxes of 500 by 500 tablet pixel resolution. Subject are monitored only during the first entry screens. Each screen contains five boxes with the digits to be written displayed above. Subjects are told to write only inside these boxes. If they make a mistake or are unhappy with their writing, they are instructed to clear the content of a box by using an on-screen button. The first ten digits are ignored because most writers are not familiar with this type of input devices, but subjects are not aware of this.

In this study, researchers use only ($x, y$) coordinate information. The stylus pressure level values are ignored. The raw data that we capture from the tablet consist of integer values between 0 and 500 (tablet input box resolution). The new coordinates are such that the coordinate which has the maximum range varies between 0 and 100. Usually $x$ stays in this range, since most characters are taller than they are wide.

Attribute information:

In order to train and test our classifiers, we need to represent digits as constant length feature vectors. A commonly used technique leading to good results is resampling the ( x_t, y_t) points. Temporal resampling (points regularly spaced in time) or spatial resampling (points regularly spaced in arc length) can be used here. Raw point data are already regularly spaced in time but the distance between them is variable.

- Data reference link :

https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

Boxplot

Outliers

Data (input2, input4, and input13) have outliers. All rows which contain outliers will be dropped. However, both data will be used for models in order to compare the results.

SVM

SVM with Original Data

SVM with Cleaned Data

Comparison between cleaned and uncleaned data results

* Cleaned data (with dropped outliers) is slightly better than uncleaned data.

* Outlier cleaning seemed to help clean the noise in the data.

* The current results are good enough. But for better results, further cleaning can be done considering the class outliers individually.

Decision Tree

DT with Original Data

DT with Cleaned Data

DT with Bagging

Comparison between cleaned, uncleaned, bagging results

* Decision Tree model test results were slightly increased while train results decreased with cleaned data. It seems that overfit decreased slightly, after dropping outliers.

* Cleaned data also used with bagging method for comparison. The results were better both for test and training scores. Overall results showed that overfitting eliminated with a better test scores with cleaned data & bagging.

Random Forest

XGBOOST

Results

* All four models has pretty good scores.

* SVM has the first place in terms of scores.

* However, in the real world scenario, score not the only thing to consider.

* For example, if there is a need of explainable model, DT is the only option.

* In this context, which algorithm to choose for deployment is the subject of the business problem we plan to solve.

Go to the GitHub page of Drinking Water Quality Assessment Project...

Go to the GitHub page of Automation Project...

1. Drinking Water Quality Assessment Project

The project aim was to have final excel sheets presenting cleaned raw chemical monitoring data and calculation of drinking water quality classes for Drinking Water Treatment Plants of Turkey. The raw data scraped from a web-based database, using Selenium. Then the data sorted according to chemical names, date etc., and assigned to the relative quality classes according to the legislation requirements, using Python, Pandas and SQL. The project automated the process 100% for four colleagues and prevented the 3-4 months of hard work for each user.

2. Automation Project for Commenting Process of the Environmental Impact Assessment Reports

The program automates the official letter traffic, including preparing response letters and attachments, downloading /uploading files from web-based/cloud platforms, and keeping an organized file system. The program processed nearly 5000 official letters, automated by 90-95%, using Python and Selenium.

Portfolio

Concrete Slump Test- Linear Regression

Raisin Class Prediction- Logistic Regression & KNN

Handwritten Digits- SVM & Decision Tree & Random Forest & XGBOOST

Exploratory Data Analysis (EDA)- Analysis of US Citizens by Income Levels

TABLEAU Visualization- “Agricultural Areas of Turkey”

Web-Scraping / Automation Projects

About

Concrete Slump Test-
Linear Regression

Raisin Class Prediction-
Logistic Regression & KNN