datamining_881/README.md

# CSE 881: Data Mining - Course Project on Gitea

## Setup

Note: If you're using a IDE like PyCharm, step 1-2 may be done automatically.

1. Create a virtual environment:
```
python -m venv venv
````

2. Activate the virtual environment:
```
source venv/bin/activate
```

3. Install dependencies:
```
pip install -r requirements.txt
```

4. Transfer Datasets - Raw data is too large to include in GitHub.  Download the dataset from:
```
https://developer.imdb.com/non-commercial-datasets/
```
Unzip the `.tsv` files and place them in:
```
data/raw/imdb_dataset
```
Files to Place:
```
name.basics.tsv
title.akas.tsv
title.basics.tsv
title.crew.tsv
title.episode.tsv
title.principals.tsv
title.ratings.tsv
```
5. Run the pipeline in order:


6.
## Citations:
https://developer.imdb.com/non-commercial-datasets/

<img width="498" height="507" alt="image" src="https://github.com/user-attachments/assets/bbda6c5e-85ba-4f49-8778-916960bba302" />