Efficiently Managing AWS S3 Files at Scale with Node.js

Proficient in React.js, Node.js, TypeScript, MongoDB, and MySQL | Strong in Software Development Life Cycle | Experienced in Building Scalable Web Applications | Passionate About Clean Code & Performance Optimization

Introduction

In a recent project, we needed to process and manage a vast amount of data stored in AWS S3.
Our challenge? Handling over 7,000 company IDs and more than 100,000 files while ensuring efficient retrieval, filtering, and management of the latest relevant files.

Challenges We Faced

Large Dataset: Scanning and processing such a high volume of files required an optimized approach.
Efficient Filtering: Identifying and retrieving only the latest 10 files per company from S3.
Performance Optimization: Minimizing API calls and handling pagination efficiently to avoid unnecessary overhead.
Storage Management: Keeping local storage clean by removing outdated files before downloading new ones.

Our Solution

Using Node.js and the AWS SDK, we developed an automated pipeline to:

Read company IDs from a pdf and tiff file.
List and filter S3 objects based on company IDs.
Maintain a structured mapping to store only the latest 10 files per company.
Download the required files while ensuring local cleanup of outdated files.

Key Implementation Steps

1. Reading Company IDs from a CSV File

async readCSV(filePath: string): Promise<string[]> {
    return new Promise((resolve, reject) => {
        const companyIds: string[] = [];
        fs.createReadStream(filePath)
            .pipe(csv())
            .on("data", (row: any) => {
                if (row.COMPANY_ID) companyIds.push(row.COMPANY_ID.trim());
            })
            .on("end", () => resolve(companyIds))
            .on("error", reject);
    });
}

We started by reading a CSV file containing 7,000+ company IDs to filter relevant files from S3.

2. Efficient S3 File Listing & Filtering

async listBuckets(accessKeyId: string, secretAccessKey: string, region: string, csvFilePath: string) {
    const companyIds = await this.readCSV(csvFilePath);
    const bucketName = "bucket-name";
    const folderPrefix = "folder-prefix";
    await configureAWSCredentials(accessKeyId, secretAccessKey, region);
    const services = getAWSServiceInstances(accessKeyId, secretAccessKey, region);
    const s3 = services.s3;
    let continuationToken: string | undefined;
    const companyFilesMap: Record<string, { pdf: any[]; tif: any[] }> = {};

    do {
        const objectsResponse = await s3.listObjectsV2({
            Bucket: bucketName,
            Prefix: folderPrefix,
            MaxKeys: 1000,
            ContinuationToken: continuationToken
        }).promise();

        if (!objectsResponse.Contents) break;

        for (const obj of objectsResponse.Contents) {
            const fileName = obj.Key.split("/").pop() || "";
            const match = fileName.match(/Company_(\d+)_/);
            if (!match) continue;
            const companyId = match[1];
            if (!companyIds.includes(companyId)) continue;

            let fileType: "pdf" | "tif" | null = null;
            if (fileName.endsWith("_original_pdf")) fileType = "pdf";
            if (fileName.endsWith("_original_tif")) fileType = "tif";
            if (!fileType) continue;

            if (!companyFilesMap[companyId]) {
                companyFilesMap[companyId] = { pdf: [], tif: [] };
            }

            companyFilesMap[companyId][fileType].push({
                Key: obj.Key,
                LastModified: new Date(obj.LastModified),
                Size: obj.Size,
            });

            companyFilesMap[companyId][fileType].sort((a, b) => b.LastModified.getTime() - a.LastModified.getTime());
            companyFilesMap[companyId][fileType] = companyFilesMap[companyId][fileType].slice(0, 10);
        }

        continuationToken = objectsResponse.NextContinuationToken;
    } while (continuationToken);

    return companyFilesMap;
}

This method efficiently filters 100,000+ files while ensuring only the latest 10 files per company are retained.

3. Managing Local File Storage

Before downloading new files, we clear any previously stored files to avoid redundant data.

async manageFolder(s3: any, companyId: string, filesMap: { pdf: any[]; tif: any[] }) {
    for (const fileType of ["pdf", "tif"] as ("pdf" | "tif")[]) {
        const downloadDir = path.join(__dirname, "downloads", companyId, fileType);

        if (fs.existsSync(downloadDir)) {
            fs.readdirSync(downloadDir).forEach((file) => {
                fs.unlinkSync(path.join(downloadDir, file));
            });
        } else {
            fs.mkdirSync(downloadDir, { recursive: true });
        }

        await Promise.all(
            filesMap[fileType].map(async (file) => {
                try {
                    await this.downloadFile(s3, file.Key, companyId, fileType);
                } catch (err) {
                    console.error(`Error downloading file: ${file.Key}`, err);
                }
            })
        );
    }
}

4. Downloading Files from S3

async downloadFile(s3: any, key: string, companyId: string, fileType: "pdf" | "tif") {
    const fileName = key.split("/").pop() || "unknown";
    const downloadDir = path.join(__dirname, "downloads", companyId, fileType);
    const filePath = path.join(downloadDir, fileName);

    if (fs.existsSync(filePath)) {
        console.log(`Skipping existing file: ${filePath}`);
        return;
    }

    console.log(`Downloading: ${fileName} -> ${filePath}`);
    const fileStream = fs.createWriteStream(filePath);
    const s3Stream = s3.getObject({ Bucket: "bucket-name", Key: key }).createReadStream();

    return new Promise((resolve, reject) => {
        s3Stream.on("error", reject).pipe(fileStream).on("finish", () => resolve(filePath));
    });
}

This approach ensures that only necessary files are stored while preventing redundant downloads.

Key Takeaways

Efficient file retrieval: The script processes over 100,000+ files without performance issues.
Optimized storage management: Ensuring only relevant files are stored locally, reducing clutter.
Automated cleanup: Old files are removed before downloading new ones.
Scalability: The approach can be extended for larger datasets by improving concurrency and adding logging.

Conclusion

Managing AWS S3 files efficiently at scale requires structured automation, optimized API
calls, and intelligent filtering. By leveraging Node.js and AWS SDK, we streamlined this
process, ensuring only relevant files are stored and downloaded while keeping local storage
organized.

Efficiently Managing AWS S3 Files at Scale with Node.js

Introduction

Challenges We Faced

Our Solution

Key Implementation Steps

Key Takeaways

Conclusion

You have reached the end of the page does not signify the end of the journey, but rather the beginning of a new chapter in innovation and perseverance.

Company

Services

Explore