48 lines
944 B
Markdown
48 lines
944 B
Markdown
# CSE 881: Data Mining - Course Project on Gitea
|
|
|
|
## Setup
|
|
|
|
Note: If you're using a IDE like PyCharm, step 1-2 may be done automatically.
|
|
|
|
1. Create a virtual environment:
|
|
```
|
|
python -m venv venv
|
|
````
|
|
|
|
2. Activate the virtual environment:
|
|
```
|
|
source venv/bin/activate
|
|
```
|
|
|
|
3. Install dependencies:
|
|
```
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
4. Transfer Datasets - Raw data is too large to include in GitHub. Download the dataset from:
|
|
```
|
|
https://developer.imdb.com/non-commercial-datasets/
|
|
```
|
|
Unzip the `.tsv` files and place them in:
|
|
```
|
|
data/raw/imdb_dataset
|
|
```
|
|
Files to Place:
|
|
```
|
|
name.basics.tsv
|
|
title.akas.tsv
|
|
title.basics.tsv
|
|
title.crew.tsv
|
|
title.episode.tsv
|
|
title.principals.tsv
|
|
title.ratings.tsv
|
|
```
|
|
5. Run the pipeline in order:
|
|
|
|
|
|
6.
|
|
## Citations:
|
|
https://developer.imdb.com/non-commercial-datasets/
|
|
|
|
<img width="498" height="507" alt="image" src="https://github.com/user-attachments/assets/bbda6c5e-85ba-4f49-8778-916960bba302" />
|