Many projects make use of the large dataset collected by the International Brain Laboratory.
This assumes that you’ve worked through the general Python setup, and that you understand GitHub, conda and some basic command line tools.
conda installto install
conda install -c conda-forge pingouin.
pip install nma-iblto get access to the IBL data via DataJoint.
dj.conn()that everything works. Then run
dj.config.save_local()to create a local config file: you won’t have to enter your DataJoint credentials anymore in the future! Note that this local file
dj_local_config.jsonwill only live inside the current folder, so it won’t be recognized if you launch Python from somewhere else.
Now create a script that will load and save some data.
# datajoint-specific stuff import datajoint as dj from nma_ibl import reference, subject, acquisition, behavior, behavior_analyses
Then, query some basic information about all sessions that were run.
# which subjects (i.e. mice) are in the database? subjects = (subject.Subject * subject.SubjectLab * reference.Lab) # this contains a lot of information that we don't really need # (and will increase the size of the data we want to download). # so let's get only the columns that we're interested in subjects = subjects.proj('subject_nickname', 'sex', 'subject_birth_date', 'time_zone') # note that this is not yet data - it's only a query to the database. fetch will actually get those data df_subjects = subjects.fetch(format='frame').sort_values(by=['lab_name', 'subject_nickname']).reset_index() # same for sessions - only take training sessions here sessions = behavior.TrialSet * behavior_analyses.PsychResults * behavior_analyses.ReactionTime \ * behavior_analyses.SessionTrainingStatus \ * (acquisition.Session & 'task_protocol LIKE "%training%"') * acquisition.SessionUser \ & subjects # # only save some fields that we really care about for now (otherwise, the dataframe will explode) sessions = sessions.proj('n_trials', 'performance_easy', 'threshold', 'bias', 'lapse_low', 'lapse_high', 'training_status', 'user_name', session_duration='TIMEDIFF(session_end_time,session_start_time)') df_sessions = sessions.fetch(format='frame').reset_index() # note: the two dataframes containing subject info and sessions info share # the column subject_uuid, which is called the 'primary key' that uniquely # identifies each mouse. use pandas' join to combine the two dataframes - # but beware the size of the data you're working with.
Now explore the DataFrame, for instance in ‘scientific mode’ in PyCharm or simply by printing different parts and groups to your command line. To better understand what the columns mean, see this list by Leon Hommerich as well as the official list of IBL dataset types (this doesn’t match with the DataJoint names one-on-one).
Exercise 1: write some code to save this newly created Pandas DataFrame as a csv file. Make sure to avoid that this (large) datafile gets pushed to GitHub, for instance by creating a
/data folder that is listed in your
.gitignore. Then, for any analysis you want to run, load in this local file - you’ll now be able to get data without connecting to the DataJoint database. Of course, you may need different datafiles for different purposes (at the level of animals, sessions, or trials).
Exercise 2: plot some basic information about all the sessions. When (at what time of day) where they collected? How many sessions were collected per lab, user, and animal? How does performance change as a function of each animal’s progression in training?
Exercise 3: get more detailed info not at the session level (overall accuracy on easy stimuli), but at the individual trial level. You can use
sessions * behavior.TrialSet.Trial to get this, but be warned that this will become huge/slow quickly. Better to first restrict to a subset of sessions (e.g. from one mouse), or to use
.proj to select only those attributes of the
TrialSet that you really need. See here for an example.