I think the speed problem is due to the difference between logical locations and physical locations. When you logically mount your Google Drive in Google Colab, the physical location of the files is very much not on Google Colab.
I tried to find some code I wrote to deal with this, but I couldn't find it.
Caveat: I dealt with the problems described below approximately 12 months ago, so there is a small chance that some things have changed.
My perspective: I'm not a programmer, but I can code in Python. I was a sys/net-admin, teacher, MCSE, "webmaster"--prior to 2005.
Because I cannot cite documentation of my claim, I will describe my problem and solution as proof of my claim. If you believe my claim, you can probably skim or skip this section.
My problem: I had up to 80 GB of (WAV) files that were physically in up to six different Google Drive accounts. With my sym-fu skills, I could effectively mount all six Google Drives at the same in one Colab session. Obviously, Colab did not transfer that data to the physical server on which my Colab session was running.
Let's say I had a Python command to concatenate 30 files into one new file: newFile.wav = concat(listPathFilenames)
. Those 30 files were physically spread across six different Google Drives. The Python interpreter would request the files from the OS (Colab), and the OS would use filesystem-level operations to move the physical files to the Colab server. Just waiting for 600 MB of files to transfer could take 30 seconds, but the operation would only take 2-5 seconds. (I wasn't really concatenating, you know?)
So, at least for a little while, my solution was to "operate" on the files before I need to operate on the files. My flow allowed me to easily predict which files I would soon need, so I had logic that would do something like
for pathFilename in listPathFilenames:
pathlib.Path('pathFilename.wav').stat()
I had to try out a few different functions to find right one. I didn't want to modify the file. And some functions wouldn't force the physical transfer of the file: like, I think .exists()
didn't work. The net effect was that the physical location of the files would be on the Colab server, and when I did the real operations on the files, there would not be a delay as the files were retrieved from Google Drive.
First, I don't have enough knowledge of pip
to understand the answer from https://stackoverflow.com/users/14208544/hor-golzari, so I would still incorporate his guidance. (Well, I mean, since you seem to understand it, you should use his knowledge.)
From what I can tell, Colab uses multiple excellent tactics to speed up on-the-fly environment creation. Off the top of my head:
git
command, to any destination, is prioritized at the network level.In contrast, the filesystem-level transfers to and from Google Drive and absolutely not prioritized. One way I know that for sure: if you "write" a (large) file Google Drive, if the Colab environment says, "the file has been written," then even a catastrophic failure in your Colab will not prevent the file from reaching Google Drive. How? It's buffered. It's not fast--some files take 15 minutes before I can see them on Google Drive--but it is reliable.
Therefore, I suspect Google Drive won't accomplish what you want: only because Colab has decided to prioritize the speed the physical connection to Google Drive as too slow to be useful.
I'm trying to optimize my Google Colab workflow
I don't know what needs optimizing, but some things I've done (that I can recall off the top of my head):
pip
doesn't need to think.The following used to be my template for quickly installing stuff. I still used "requirements.txt" files at the time. I've switched to pyproject.toml, and I guess I would probably use something like pip install {repoTarget}@git+https://{accessTokenGithub}@github.com/{repoOwner}/{repoTarget}.git
. idk.
import sys
import subprocess
import pathlib
listPackages = ['Z0Z_tools']
def cloneRepo(repoTarget:str, repoOwner:str = 'hunterhogan') -> None:
if not pathlib.Path(repoTarget).exists():
accessTokenGithub = userdata.get('github_token')
subprocess.run(["git", "clone", f"https://{accessTokenGithub}@github.com/{repoOwner}/{repoTarget}.git"], check=True)
pathFilenameRequirements = pathlib.Path(repoTarget) / 'requirements.txt'
if pathFilenameRequirements.exists():
listPackages.append(f"-r {pathFilenameRequirements}")
sys.path.append(repoTarget)
if 'google.colab' in sys.modules:
from google.colab import drive, userdata
drive.mount('/content/drive')
cloneRepo('stubFileNotFound')
cloneRepo('astToolFactory')
%pip install -q {' '.join(listPackages)}