19 Commits

Author SHA1 Message Date
IshaAtteri
a435592f75 Merge branch 'main' of https://github.com/IshaAtteri/datamining_881 into isha 2026-03-12 12:41:15 -04:00
IshaAtteri
437492e623 small changes 2026-03-12 12:16:51 -04:00
prabhaavp
525e359c6b - Html -> TSV 2026-03-12 12:14:31 -04:00
IshaAtteri
a1beba6730 beatifulsoup extract code 2026-03-12 12:11:37 -04:00
prabhaavp
1614d85270 - Fixed Bug: Certain characters can't be used for folder names. Need to fix it so those characters are removed. There is now a sanitize_slug function used 2026-03-10 14:45:45 -04:00
prabhaavp
cfbddf2a24 - Updates to make it name the folder the name of the wikipedia slug. Fix needed: Certain characters can't be used for folder names. Need to fix it so those characters are removed. 2026-03-10 14:15:33 -04:00
IshaAtteri
8fa2cdba3c preprocessing script 2026-03-10 14:14:59 -04:00
Vadella, Anna
2ec6f8c28a testing extract_wiki_zim.py 2026-03-10 13:29:56 -04:00
prabhaavp
36af063777 - Delete the folders if we skipped a movie due to not being found 2026-03-10 13:17:21 -04:00
prabhaavp
0ac1234afa - Fix directories 2026-03-10 13:10:25 -04:00
prabhaavp
401e7e5497 - Extract info needed from ZIM file 2026-02-12 20:07:09 -05:00
IshaAtteri
9412c834f1 Merge pull request #2 from IshaAtteri/isha
structure change
2026-02-11 17:56:24 -05:00
IshaAtteri
cb2fcd19eb structure change 2026-02-11 17:55:24 -05:00
IshaAtteri
ed2e20f8cd Merge pull request #1 from IshaAtteri/isha has the code
Isha
2026-02-11 17:54:04 -05:00
IshaAtteri
0cc571727b wikipedia movie scraping using api code 2026-02-11 17:51:38 -05:00
IshaAtteri
30dbfe0dcc code for job system stuff 2026-02-11 17:40:59 -05:00
prabhaavp
369f5ced89 Update README.md
Updated readme to include structure picture
2026-02-03 22:25:28 -05:00
prabhaavp
2d2ee64c0e - Added venv instruction + requirements.txt
- Added data folder structure with .gitkeep
- Added .gitignore
- Added load.py to load IMDB dataset and preview with D-Tale
2026-02-03 22:21:41 -05:00
IshaAtteri
c18b412867 Initial commit 2026-01-27 12:39:22 -05:00