
Proficient in React.js, Node.js, TypeScript, MongoDB, and MySQL | Strong in Software Development Life Cycle | Experienced in Building Scalable Web Applications | Passionate About Clean Code & Performance Optimization
Introduction
In a recent project, we needed to process and manage a vast amount of data stored in AWS S3.
Our challenge? Handling over 7,000 company IDs and more than 100,000 files while ensuring efficient retrieval, filtering, and management of the latest relevant files.
Challenges We Faced
- Large Dataset: Scanning and processing such a high volume of files required an optimized approach.
- Efficient Filtering: Identifying and retrieving only the latest 10 files per company from S3.
- Performance Optimization: Minimizing API calls and handling pagination efficiently to avoid unnecessary overhead.
- Storage Management: Keeping local storage clean by removing outdated files before downloading new ones.
Our Solution
Using Node.js and the AWS SDK, we developed an automated pipeline to:
- Read company IDs from a pdf and tiff file.
- List and filter S3 objects based on company IDs.
- Maintain a structured mapping to store only the latest 10 files per company.
- Download the required files while ensuring local cleanup of outdated files.
Key Implementation Steps
1. Reading Company IDs from a CSV File
async readCSV(filePath: string): Promise<string[]> {
return new Promise((resolve, reject) => {
const companyIds: string[] = [];
fs.createReadStream(filePath)
.pipe(csv())
.on("data", (row: any) => {
if (row.COMPANY_ID) companyIds.push(row.COMPANY_ID.trim());
})
.on("end", () => resolve(companyIds))
.on("error", reject);
});
}
We started by reading a CSV file containing 7,000+ company IDs to filter relevant files from S3.
2. Efficient S3 File Listing & Filtering
async listBuckets(accessKeyId: string, secretAccessKey: string, region: string, csvFilePath: string) {
const companyIds = await this.readCSV(csvFilePath);
const bucketName = "bucket-name";
const folderPrefix = "folder-prefix";
await configureAWSCredentials(accessKeyId, secretAccessKey, region);
const services = getAWSServiceInstances(accessKeyId, secretAccessKey, region);
const s3 = services.s3;
let continuationToken: string | undefined;
const companyFilesMap: Record<string, { pdf: any[]; tif: any[] }> = {};
do {
const objectsResponse = await s3.listObjectsV2({
Bucket: bucketName,
Prefix: folderPrefix,
MaxKeys: 1000,
ContinuationToken: continuationToken
}).promise();
if (!objectsResponse.Contents) break;
for (const obj of objectsResponse.Contents) {
const fileName = obj.Key.split("/").pop() || "";
const match = fileName.match(/Company_(\d+)_/);
if (!match) continue;
const companyId = match[1];
if (!companyIds.includes(companyId)) continue;
let fileType: "pdf" | "tif" | null = null;
if (fileName.endsWith("_original_pdf")) fileType = "pdf";
if (fileName.endsWith("_original_tif")) fileType = "tif";
if (!fileType) continue;
if (!companyFilesMap[companyId]) {
companyFilesMap[companyId] = { pdf: [], tif: [] };
}
companyFilesMap[companyId][fileType].push({
Key: obj.Key,
LastModified: new Date(obj.LastModified),
Size: obj.Size,
});
companyFilesMap[companyId][fileType].sort((a, b) => b.LastModified.getTime() - a.LastModified.getTime());
companyFilesMap[companyId][fileType] = companyFilesMap[companyId][fileType].slice(0, 10);
}
continuationToken = objectsResponse.NextContinuationToken;
} while (continuationToken);
return companyFilesMap;
}
This method efficiently filters 100,000+ files while ensuring only the latest 10 files per company are retained.
3. Managing Local File Storage
Before downloading new files, we clear any previously stored files to avoid redundant data.
async manageFolder(s3: any, companyId: string, filesMap: { pdf: any[]; tif: any[] }) {
for (const fileType of ["pdf", "tif"] as ("pdf" | "tif")[]) {
const downloadDir = path.join(__dirname, "downloads", companyId, fileType);
if (fs.existsSync(downloadDir)) {
fs.readdirSync(downloadDir).forEach((file) => {
fs.unlinkSync(path.join(downloadDir, file));
});
} else {
fs.mkdirSync(downloadDir, { recursive: true });
}
await Promise.all(
filesMap[fileType].map(async (file) => {
try {
await this.downloadFile(s3, file.Key, companyId, fileType);
} catch (err) {
console.error(`Error downloading file: ${file.Key}`, err);
}
})
);
}
}
4. Downloading Files from S3
async downloadFile(s3: any, key: string, companyId: string, fileType: "pdf" | "tif") {
const fileName = key.split("/").pop() || "unknown";
const downloadDir = path.join(__dirname, "downloads", companyId, fileType);
const filePath = path.join(downloadDir, fileName);
if (fs.existsSync(filePath)) {
console.log(`Skipping existing file: ${filePath}`);
return;
}
console.log(`Downloading: ${fileName} -> ${filePath}`);
const fileStream = fs.createWriteStream(filePath);
const s3Stream = s3.getObject({ Bucket: "bucket-name", Key: key }).createReadStream();
return new Promise((resolve, reject) => {
s3Stream.on("error", reject).pipe(fileStream).on("finish", () => resolve(filePath));
});
}
This approach ensures that only necessary files are stored while preventing redundant downloads.
Key Takeaways
- Efficient file retrieval: The script processes over 100,000+ files without performance issues.
- Optimized storage management: Ensuring only relevant files are stored locally, reducing clutter.
- Automated cleanup: Old files are removed before downloading new ones.
- Scalability: The approach can be extended for larger datasets by improving concurrency and adding logging.
Conclusion
Managing AWS S3 files efficiently at scale requires structured automation, optimized API
calls, and intelligent filtering. By leveraging Node.js and AWS SDK, we streamlined this
process, ensuring only relevant files are stored and downloaded while keeping local storage
organized.