Page 1 of 1

Extracting Data from PDFs / Unstructured Text (Small Scale)

Posted: Thu May 29, 2025 6:26 am
by sakib40
How to do it: If you have a tiny dataset, you can "augment" it to create more examples. For images, apply rotations, flips, brightness changes. For text, try synonym replacement or simple paraphrasing.
Time commitment: 2-6 hours (learning augmentation techniques, applying code).
Tools: Python (TensorFlow/Keras, PyTorch, Albumentations for images; NLTK for text).
Why it's free & fast: You're reusing what you have, maximizing its value.
Example: Taking 20 photos of local fruits from Mohadevpur and augmenting dataset them to create 200 variations for a fruit classification model.
How to do it: For very small, specific datasets, manually copy-paste relevant information from a few publicly available PDFs or unstructured text documents into a spreadsheet. For slightly larger scale, explore Python libraries like PyPDF2 or Tabula-py (for tables in PDFs), or regex for pattern extraction.
Time commitment: 5-10 hours (identifying source, manual extraction/scripting, cleaning).
Tools: PDF reader, text editor, Python (PyPDF2, Tabula-py, regex).
Ethical considerations: Ensure the PDF/text is publicly available and does not contain sensitive or copyrighted information that restricts your use.
Example: Extracting public company names and their declared industries from a few annual reports (if publicly available) for a business categorization project.
Remember, the goal in 24 hours is usability, not perfection. Embrace the agile approach: collect, clean a little, model, learn, and then iterate. You'll be amazed at what you can achieve when you focus on resourceful, ethical, and practical data collection within a tight timeframe!