This is my Module 4 project for EAI6010: Applications of Artificial Intelligence.
I used the UCI SECOM dataset to study a semiconductor pass/fail prediction problem with sensor data. The main goal was not just to run one model. I wanted to adapt an off-the-shelf tutorial idea into a cleaner workflow that fit this dataset better.
In this final version, I used a leakage-safe train/validation/test setup, compared multiple models, and evaluated results with metrics that make more sense for an imbalanced pass/fail problem.
- Type: Individual course project
- Course: EAI6010: Applications of Artificial Intelligence
- Module: Module 4 - Using an Off-the-shelf Model
- Focus: Semiconductor quality-monitoring style classification
This project is framed as a semiconductor quality-monitoring problem.
Given sensor measurements from a manufacturing process, can I identify whether a sample is more likely to pass or fail? In practice, a model like this could be more useful as an early screening tool than as a fully automated final decision system.
My goal was to take an existing tutorial idea and rebuild it for a different dataset and a more realistic tabular workflow.
I wanted to:
- use a semiconductor-related dataset
- handle missing values more carefully
- avoid data leakage
- compare strong baselines instead of depending on only one model
- use evaluation metrics that fit an imbalanced classification problem
- think honestly about deployment limits
This project uses the UCI SECOM dataset.
Main files in this repo:
data/secom.data- sensor featuresdata/secom_labels.data- labels and timestamp-related fieldsdata/secom.names- dataset metadata from UCI
Based on the notebook:
- Rows: 1567
- Sensor features: 590
- Target distribution: 1463 pass / 104 fail
More details are in data/README.md.
- Python
- Jupyter Notebook
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- PyTorch
- data loading
- missing-value audit
- timestamp-based exploratory analysis
- train / validation / test split
- training-only missingness filtering
- median imputation
- constant-feature removal
- scaling and PCA
- model comparison
- threshold tuning
- final holdout test evaluation
- Dummy Classifier
- Logistic Regression + PCA
- Random Forest
- class-weighted PyTorch MLP
On the validation split, Random Forest had the strongest balanced accuracy among the compared models, so I selected it for final test evaluation.
- Accuracy: 0.7898
- Balanced accuracy: 0.6663
- Recall: 0.5238
- ROC-AUC: 0.7978
- PR-AUC: 0.2192
The model correctly detected 11 of the 21 fail cases on the test set, but it also produced many false positives.
Because of that, I see this as a stronger analytical prototype or screening model, not something I would deploy directly as a final automated pass/fail decision system.
- Final notebook:
notebooks/EAI6010_Module_4_Assignment_V2_Cheng_L.ipynb - Final assignment report:
reports/EAI6010_Module_4_Assignment_V2_Cheng_L.pdf - Portfolio PDF version:
reports/EAI6010_SECOM_Portfolio_Cheng_Liu.pdf - Walkthrough version:
walkthrough/project-walkthrough.md - Dataset note:
data/README.md - Figure note:
outputs/README.md
- Clone or download this repository.
- Install the required Python packages:
pip install -r requirements.txt
- Open the main notebook:
notebooks/EAI6010_Module_4_Assignment_V2_Cheng_L.ipynb
- Run the notebook cells in order.
This notebook is the main runnable artifact in the repo. It first checks for the SECOM files in data/ or ../data/, then falls back to /content/secom_project and can re-download the UCI files if needed.
This project shows that I can:
- work with messy tabular sensor data
- handle missing values and imbalanced classes more carefully
- build a leakage-safe workflow
- compare simple and nonlinear models
- evaluate results beyond plain accuracy
- explain model limits honestly instead of overselling the result
This is still not a production-ready manufacturing model.
Main limits:
- small minority class
- many missing values
- high-dimensional noisy features
- possible process shift over time
- split-specific model selection
- threshold still needs business or engineering cost discussion
For a more deployment-oriented version, I would still want:
- repeated validation or cross-validation
- time-based validation
- drift monitoring
- threshold setting with domain stakeholders
- more engineering validation for important sensor signals
This is an individual course project.
I selected the dataset, rebuilt the workflow, ran the analysis, compared the models, interpreted the results, and wrote the final report and portfolio version.
- UCI Machine Learning Repository. SECOM Dataset.
- Vinit. Semiconductor/IOT-MachineLearning. Kaggle.
- Howard, J., & Gugger, S. Deep Learning for Coders with Fastai and PyTorch. O'Reilly.