- Added data folder structure with .gitkeep - Added .gitignore - Added load.py to load IMDB dataset and preview with D-Tale
43 lines
768 B
Markdown
43 lines
768 B
Markdown
# CSE 881: Data Mining - Course Project
|
|
|
|
## Setup
|
|
|
|
Note: If you're using a IDE like PyCharm, step 1-2 may be done automatically.
|
|
|
|
1. Create a virtual environment:
|
|
```
|
|
python -m venv venv
|
|
````
|
|
|
|
2. Activate the virtual environment:
|
|
```
|
|
source venv/bin/activate
|
|
```
|
|
|
|
3. Install dependencies:
|
|
```
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
4. Transfer Datasets - Raw data is too large to include in GitHub. Download the dataset from:
|
|
```
|
|
https://developer.imdb.com/non-commercial-datasets/
|
|
```
|
|
Unzip the `.tsv` files and place them in:
|
|
```
|
|
data/raw/imdb_dataset
|
|
```
|
|
Files to Place:
|
|
```
|
|
name.basics.tsv
|
|
title.akas.tsv
|
|
title.basics.tsv
|
|
title.crew.tsv
|
|
title.episode.tsv
|
|
title.principals.tsv
|
|
title.ratings.tsv
|
|
```
|
|
|
|
## Citations:
|
|
https://developer.imdb.com/non-commercial-datasets/
|