I am trying to stream large files from HTTP to S3 directly.
I rather not download the file and then stream it, i am trying to do it directly.
so the source is big file(60GB) that is to be streamed from http server, the dest is s3 bucket.
i have tested on two envoirments:
on my WSL env, when memory gets to 100% the script gets killed, setting max_concurrency to 2, nothing really helps, why the heck i still get memory overload?
on Ec2 (micro) machine, which is where i want to run the code, the boto code does not even run or show any error ? maybe i need to increase the memory of machine from 1GB to 2-3 ?
but i still would like to keep it on free tier...
Is there anyway to stream such large files directly ?
when i stream small files, like 1GB or less, its working without a problem..
i think the problem is with memory issues, that the code trys to read the http file into memory and upload, maybe the way is to read it into memory in chunks and stream in chunks ?
how i do it, i am not python expert.. been working on it for days..
def stream_to_s3(self, source_filename, remote_filename):
error = 0
self.log(f"====> Streaming {source_filename} to S3://{remote_filename}")
s3 = boto3.resource('s3')
bucket = s3.Bucket(self.params['UPLOAD_TO_S3']['S3_BUCKET'])
destination = bucket.Object(remote_filename)
with self.session.get(source_filename, stream=True) as response:
GB = 1024 ** 3
MB = 1024 * 1024
max_threshold = 5 * GB
# if int(response.headers['content-length']) > max_threshold:
TC = TransferConfig(multipart_threshold=max_threshold, max_concurrency=2, multipart_chunksize=8 * MB, use_threads=True)
try:
destination.upload_fileobj(response.raw, Config=TC)
except Exception as e:
self.log(f"====> Failure streaming file to S3://{remote_filename}. Reason: {e}")
return 1
self.log(f"====> Succeeded streaming file to S3://{remote_filename}")